This example uses the tidyverse suite of packages.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.90 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ggridges)
## Warning: package 'ggridges' was built under R version 4.1.2
There are multiple data sets associated with this project. You will start with a small, simplified design. This will allow you to get practice fitting models, selecting the best model, and making predictions. You will demonstrate selecting optimal input configurations with this simplified design before tackling the more complicated larger problem.
The simplified data set is read in below. It is assumed that this markdown is located in the same directory as the data. If you want to run this markdown yourself, you should download the data sets from Canvas and place them in the same directory as this .Rmd file. It is highly recommended that you work with an RStudio RProject when working on the final project.
df_start <- readr::read_csv('small_train_data.csv', col_names = TRUE)
## Rows: 125 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (6): x07, x09, x10, x11, x21, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#There are 125 observations, 5 inputs with target variable as response in the above small data set The simplified design consists of 5 inputs, x07, x09, x10, x11, and x21, and one continuous output, response. The input variable names are consistent with the larger data set hence why their numbering does not start with x01. A glimpse of the data is given below.
#, read_start_data, show_small_df
df_start %>% glimpse()
## Rows: 125
## Columns: 6
## $ x07 <dbl> 0.11177674, 0.25100882, 0.65946235, 0.94918516, 0.66804925, 0~
## $ x09 <dbl> 0.56766604, 0.04336232, 0.66390665, 0.65207424, 0.78146069, 0~
## $ x10 <dbl> 0.396381547, 0.618260905, 0.048903943, 0.269819648, 0.2235553~
## $ x11 <dbl> 0.58387771, 0.53653216, 0.48976317, 0.30652758, 0.11211528, 0~
## $ x21 <dbl> 0.810788181, 0.736638473, 0.369762516, 0.926768468, 0.7161686~
## $ response <dbl> -0.292184433, 0.693360200, -0.249312839, 0.378085101, 0.74714~
##EXPLORATORY DATA ANALYSIS OF SMALL DATA
#Check for missing values
visdat::vis_miss(df_start)
#There is no missing value in the above variables
visdat::vis_dat(df_start)
#All the data type are numeric, hence linear regression model will be appropriate
df_start %>% purrr::map_dbl(n_distinct)
## x07 x09 x10 x11 x21 response
## 125 125 125 125 125 125
#There are 125 unique values for each combination of observation
df_start %>% select(-response) %>% summary()
## x07 x09 x10 x11
## Min. :0.0007088 Min. :0.007776 Min. :0.005096 Min. :0.001213
## 1st Qu.:0.2510088 1st Qu.:0.248175 1st Qu.:0.251163 1st Qu.:0.250142
## Median :0.5034108 Median :0.502097 Median :0.501573 Median :0.498635
## Mean :0.5003370 Mean :0.499948 Mean :0.499693 Mean :0.500270
## 3rd Qu.:0.7480832 3rd Qu.:0.749657 3rd Qu.:0.746016 3rd Qu.:0.748200
## Max. :0.9972816 Max. :0.994879 Max. :0.993792 Max. :0.999096
## x21
## Min. :0.004908
## 1st Qu.:0.251978
## Median :0.502134
## Mean :0.499890
## 3rd Qu.:0.747096
## Max. :0.993307
df_start %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid")) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 35) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(axis.text.y = element_blank())
#As it can be seen above, the mean and median of the five input distributions are approximately equal indicating even distribution across the input and do not need transformation. Though the response looks gaussian with left skewness.
df_start %>% skimr::skim()
| Name | Piped data |
| Number of rows | 125 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| x07 | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| x09 | 0 | 1 | 0.50 | 0.29 | 0.01 | 0.25 | 0.50 | 0.75 | 0.99 | ▇▇▇▇▇ |
| x10 | 0 | 1 | 0.50 | 0.29 | 0.01 | 0.25 | 0.50 | 0.75 | 0.99 | ▇▇▇▇▇ |
| x11 | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| x21 | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 0.99 | ▇▇▇▇▇ |
| response | 0 | 1 | 0.05 | 0.78 | -1.74 | -0.55 | 0.37 | 0.64 | 1.31 | ▂▃▃▇▅ |
df_start %>%
tibble::rowid_to_column() %>%
pivot_longer(cols = !c('rowid',"response")) %>%
ggplot(mapping = aes(x=value, y=response))+
geom_point()+
facet_wrap(~name,scales = "free_x")
df_start %>% select(-response) %>%
GGally::ggpairs(progress = FALSE, diag = list(continuous = GGally::wrap('barDiag',bins=25))) +
theme_bw()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
df_start %>% cor() %>%
corrplot(method="color")
#The figure above indicate we do not have any significant input correllation between the input variables but the distribution looks uniform instead of gaussian
##EXPLORATORY DATA ANALYSIS OF LARGE DATA, X-VARIABLE
train_x <- readr::read_csv("train_input_set_x.csv", col_names = TRUE)
## Rows: 1899 Columns: 44
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (44): run_id, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_x %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ run_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, ~
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.452833469~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.797549~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0.7~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0.7~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0.1~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.039929672~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.699955~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.707366~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.0~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.4~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0.2~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0.6~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0.3~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0.9~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0.9~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0.3~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.671517414~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0.7~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0.2~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.956529087~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0.1~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0.7~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.788406~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0.5~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.422873248~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0.7~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0.7~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0.1~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0.6~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.262299~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0.5~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0.4~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0.3~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.011386~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.317191375~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0.9~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.613027684~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0.2~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.4~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.063404346~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.758328~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.217674308~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0.5~
train_outputs <- readr::read_csv("train_outputs.csv", col_names = TRUE)
## Rows: 1899 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): outcome
## dbl (2): run_id, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
ready_x_A <- train_x %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -outcome)
ready_x_A %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.4528334~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.7975~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.0399296~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.6999~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.7073~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.6715174~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.9565290~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.7884~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.4228732~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.2622~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.0113~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.3171913~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.6130276~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.0634043~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.7583~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.2176743~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0~
## $ response <dbl> 0.74119845, 0.33409119, 0.53881939, 0.06878022, 0.46564607, 0~
train_x %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid")) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 35) +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(axis.text.y = element_blank())
train_x %>% skimr::skim()
| Name | Piped data |
| Number of rows | 1899 |
| Number of columns | 44 |
| _______________________ | |
| Column type frequency: | |
| numeric | 44 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| run_id | 0 | 1 | 950.00 | 548.34 | 1 | 475.50 | 950.00 | 1424.50 | 1899 | ▇▇▇▇▇ |
| x01 | 0 | 1 | 0.51 | 0.31 | 0 | 0.24 | 0.52 | 0.80 | 1 | ▇▆▆▆▇ |
| x02 | 0 | 1 | 0.50 | 0.30 | 0 | 0.22 | 0.49 | 0.77 | 1 | ▇▆▆▆▇ |
| x03 | 0 | 1 | 0.49 | 0.31 | 0 | 0.21 | 0.48 | 0.77 | 1 | ▇▆▅▆▇ |
| x04 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.48 | 0.78 | 1 | ▇▇▆▆▇ |
| x05 | 0 | 1 | 0.47 | 0.30 | 0 | 0.22 | 0.43 | 0.73 | 1 | ▇▇▆▆▆ |
| x06 | 0 | 1 | 0.49 | 0.30 | 0 | 0.23 | 0.46 | 0.76 | 1 | ▇▇▆▆▇ |
| x07 | 0 | 1 | 0.49 | 0.29 | 0 | 0.24 | 0.49 | 0.75 | 1 | ▇▇▇▇▇ |
| x08 | 0 | 1 | 0.48 | 0.30 | 0 | 0.22 | 0.44 | 0.75 | 1 | ▇▇▆▆▇ |
| x09 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.50 | 0.79 | 1 | ▇▆▆▆▇ |
| x10 | 0 | 1 | 0.50 | 0.31 | 0 | 0.20 | 0.50 | 0.79 | 1 | ▇▆▆▆▇ |
| x11 | 0 | 1 | 0.51 | 0.30 | 0 | 0.24 | 0.52 | 0.78 | 1 | ▇▇▆▇▇ |
| x12 | 0 | 1 | 0.50 | 0.31 | 0 | 0.21 | 0.48 | 0.79 | 1 | ▇▆▆▆▇ |
| x13 | 0 | 1 | 0.51 | 0.31 | 0 | 0.24 | 0.51 | 0.80 | 1 | ▇▆▆▆▇ |
| x14 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.49 | 0.78 | 1 | ▇▆▆▆▇ |
| x15 | 0 | 1 | 0.52 | 0.31 | 0 | 0.23 | 0.53 | 0.80 | 1 | ▇▅▆▆▇ |
| x16 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.49 | 0.77 | 1 | ▇▆▆▆▇ |
| x17 | 0 | 1 | 0.50 | 0.30 | 0 | 0.24 | 0.49 | 0.78 | 1 | ▇▇▆▆▇ |
| x18 | 0 | 1 | 0.56 | 0.29 | 0 | 0.31 | 0.67 | 0.78 | 1 | ▃▃▃▇▅ |
| x19 | 0 | 1 | 0.56 | 0.29 | 0 | 0.31 | 0.65 | 0.78 | 1 | ▅▃▃▇▅ |
| x20 | 0 | 1 | 0.57 | 0.28 | 0 | 0.33 | 0.67 | 0.78 | 1 | ▃▃▃▇▅ |
| x21 | 0 | 1 | 0.57 | 0.29 | 0 | 0.33 | 0.68 | 0.79 | 1 | ▃▃▃▇▅ |
| x22 | 0 | 1 | 0.57 | 0.28 | 0 | 0.34 | 0.68 | 0.78 | 1 | ▃▃▃▇▅ |
| x23 | 0 | 1 | 0.57 | 0.29 | 0 | 0.32 | 0.66 | 0.79 | 1 | ▃▃▃▇▅ |
| x24 | 0 | 1 | 0.50 | 0.27 | 0 | 0.29 | 0.49 | 0.71 | 1 | ▆▇▇▇▆ |
| x25 | 0 | 1 | 0.50 | 0.27 | 0 | 0.29 | 0.50 | 0.72 | 1 | ▆▇▇▇▆ |
| x26 | 0 | 1 | 0.50 | 0.28 | 0 | 0.28 | 0.50 | 0.70 | 1 | ▆▆▇▅▆ |
| x27 | 0 | 1 | 0.50 | 0.28 | 0 | 0.28 | 0.50 | 0.72 | 1 | ▆▆▇▆▆ |
| x28 | 0 | 1 | 0.49 | 0.30 | 0 | 0.22 | 0.49 | 0.77 | 1 | ▇▇▆▆▇ |
| x29 | 0 | 1 | 0.50 | 0.31 | 0 | 0.23 | 0.51 | 0.78 | 1 | ▇▆▆▆▇ |
| x30 | 0 | 1 | 0.58 | 0.29 | 0 | 0.33 | 0.62 | 0.83 | 1 | ▅▅▅▆▇ |
| x31 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.49 | 0.79 | 1 | ▇▆▆▆▇ |
| x32 | 0 | 1 | 0.50 | 0.31 | 0 | 0.21 | 0.50 | 0.78 | 1 | ▇▆▆▆▇ |
| x33 | 0 | 1 | 0.50 | 0.31 | 0 | 0.23 | 0.50 | 0.79 | 1 | ▇▆▆▆▇ |
| x34 | 0 | 1 | 0.50 | 0.31 | 0 | 0.23 | 0.49 | 0.79 | 1 | ▇▆▆▆▇ |
| x35 | 0 | 1 | 0.50 | 0.30 | 0 | 0.22 | 0.50 | 0.78 | 1 | ▇▆▆▇▇ |
| x36 | 0 | 1 | 0.49 | 0.31 | 0 | 0.21 | 0.50 | 0.77 | 1 | ▇▆▆▆▇ |
| x37 | 0 | 1 | 0.49 | 0.31 | 0 | 0.21 | 0.49 | 0.78 | 1 | ▇▆▆▆▇ |
| x38 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.50 | 0.79 | 1 | ▇▆▆▆▇ |
| x39 | 0 | 1 | 0.50 | 0.31 | 0 | 0.21 | 0.51 | 0.78 | 1 | ▇▆▆▆▇ |
| x40 | 0 | 1 | 0.48 | 0.29 | 0 | 0.22 | 0.47 | 0.71 | 1 | ▇▇▇▆▆ |
| x41 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.49 | 0.78 | 1 | ▇▆▆▆▇ |
| x42 | 0 | 1 | 0.50 | 0.31 | 0 | 0.22 | 0.50 | 0.78 | 1 | ▇▆▆▆▇ |
| x43 | 0 | 1 | 0.50 | 0.30 | 0 | 0.22 | 0.49 | 0.77 | 1 | ▇▆▆▆▇ |
train_x %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid")) %>%
ggplot(mapping = aes(x = value)) +
geom_boxplot() +
facet_wrap(~name, scales = "free") +
theme_bw() +
theme(axis.text.y = element_blank())
large_train_x_df <- readr::read_csv('train_input_set_x.csv', col_names = TRUE)
## Rows: 1899 Columns: 44
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (44): run_id, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
large_train_x_df %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ run_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, ~
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.452833469~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.797549~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0.7~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0.7~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0.1~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.039929672~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.699955~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.707366~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.0~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.4~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0.2~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0.6~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0.3~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0.9~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0.9~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0.3~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.671517414~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0.7~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0.2~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.956529087~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0.1~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0.7~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.788406~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0.5~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.422873248~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0.7~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0.7~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0.1~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0.6~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.262299~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0.5~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0.4~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0.3~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.011386~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.317191375~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0.9~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.613027684~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0.2~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.4~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.063404346~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.758328~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.217674308~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0.5~
large_train_outputs_df <- readr::read_csv('train_outputs.csv', col_names = TRUE)
## Rows: 1899 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): outcome
## dbl (2): run_id, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
large_train_outputs_df %>% glimpse()
## Rows: 1,899
## Columns: 3
## $ run_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18~
## $ response <dbl> 0.74119845, 0.33409119, 0.53881939, 0.06878022, 0.46564607, 0~
## $ outcome <chr> "event", "non_event", "event", "event", "non_event", "non_eve~
large_train_x_df %>%
left_join(large_train_outputs_df, by = 'run_id') %>%
pivot_longer(!c('run_id', 'response', 'outcome')) %>%
ggplot(aes(x=value, color = outcome)) +
geom_freqpoly(aes(y=stat(density))) +
facet_wrap(~name)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
##EXPLORATORY DATA ANALYSIS OF LARGE DATA, V-VARIABLE
train_x <- readr::read_csv("train_input_set_x.csv", col_names = TRUE)
## Rows: 1899 Columns: 44
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (44): run_id, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_v <- readr::read_csv("train_input_set_v.csv", col_names = TRUE)
## Rows: 1899 Columns: 42
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (42): run_id, v01, v02, v03, v04, v05, v06, v07, v08, v09, v10, v11, v12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_outputs <- readr::read_csv("train_outputs.csv", col_names = TRUE)
## Rows: 1899 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): outcome
## dbl (2): run_id, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
The classification training data set for the “x-variables” is created below. The continuous output, response, is now dropped while the categorical variable outcome is retained with the “x-variable” inputs. The outcome variable is converted to a factor data type for you so that way all students will work with the same level (unique category) ordering. The classification training set for the “x-variables” is named ready_x_B and is shown via a glimpse below.
ready_x_B <- train_x %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -response) %>%
mutate(outcome = factor(outcome, levels = c("event", "non_event")))
ready_x_B %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.45283346~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.79754~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0.~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0.~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0.~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.03992967~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.69995~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.70736~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0.~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0.~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0.~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0.~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0.~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0.~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.67151741~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0.~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0.~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.95652908~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0.~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0.~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.78840~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0.~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.42287324~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0.~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0.~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0.~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0.~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.26229~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0.~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0.~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0.~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.01138~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.31719137~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0.~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.61302768~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0.~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.06340434~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.75832~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.21767430~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0.~
## $ outcome <fct> event, non_event, event, event, non_event, non_event, non_even~
ready_x_B %>%
ggplot(mapping = aes(x = outcome)) +
geom_bar() +
geom_text(stat = 'count',
mapping = aes(label = stat(count)),
color = 'red',
nudge_y = 7,
size = 5.5) +
theme_bw()
ready_x_B %>%
ggplot(mapping = aes(x = outcome)) +
geom_bar(mapping = aes(y = stat(prop),
group = 1)) +
geom_text(stat = 'count',
mapping = aes(y = after_stat( count / sum(count) ),
label = after_stat( signif(count / sum(count), 3) )),
color = 'red', nudge_y = 0.0275, size = 5.5) +
theme_bw()
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 27) +
facet_wrap(~ input_id, scales = "free_y") +
theme_bw() +
theme(axis.text.y = element_blank(),
strip.text = element_text(size = 6.5))
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_boxplot(mapping = aes(group = input_id), color = 'red') +
theme_bw()
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_boxplot(mapping = aes(group = interaction(input_id, outcome),
fill = outcome,
color = outcome),
alpha = 0.25) +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(legend.position = "top")
The value above appear to concentrate around 0.5 with few exception of input variables and the class-label appear to be even accross input variable.
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_boxplot(mapping = aes(group = interaction(input_id, outcome),
fill = outcome,
color = outcome),
alpha = 0.1) +
stat_summary(fun.data = 'mean_se',
fun.args = list(mult = 2),
mapping = aes(group = interaction(input_id, outcome),
color = outcome),
position = position_dodge(0.75)) +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(legend.position = "top")
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = as.factor(input_id), y = value)) +
stat_summary(fun.data = 'mean_se',
fun.args = list(mult = 2),
mapping = aes(group = interaction(input_id, outcome),
color = outcome)) +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(legend.position = "top")
The proportion of the event and non-event at each input variables
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = input_id), fill = 'grey') +
theme_bw()
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = input_id), fill = 'grey') +
facet_wrap(~input_bin, scales = "free_x") +
theme_bw() +
theme(strip.text = element_blank())
The distribution looks uniforms for the majority of input variables with 18 to 23 inputs values concentrated at 1 value.
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = interaction(input_id, outcome),
fill = outcome),
alpha = 0.55) +
facet_wrap(~input_bin, scales = "free_x") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank(),
legend.position = "top")
Beginning from number id 33 to 43 input variables, the class contribution appear to be similar by focusing on the top-right facet from above.
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
filter(input_id < 23 & input_id > 11) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = interaction(input_id, outcome),
fill = outcome),
alpha = 0.55) +
facet_wrap(~input_bin, scales = "free_x") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank(),
legend.position = "top")
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = value, y = as.factor(input_id))) +
geom_density_ridges() +
facet_wrap(~input_bin, scales = "free_y") +
theme_bw() +
theme(strip.text = element_blank())
## Picking joint bandwidth of 0.0603
## Picking joint bandwidth of 0.0589
## Picking joint bandwidth of 0.0578
## Picking joint bandwidth of 0.0607
Input variables 18 to 23 number id appear to be left-skewed distribution
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = value, y = as.factor(input_id))) +
geom_density_ridges(mapping = aes(fill = outcome),
alpha = 0.5) +
facet_wrap(~input_bin, scales = "free_y") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank())
## Picking joint bandwidth of 0.0671
## Picking joint bandwidth of 0.0677
## Picking joint bandwidth of 0.0665
## Picking joint bandwidth of 0.0698
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
filter(input_id < 23 & input_id > 11) %>%
ggplot(mapping = aes(x = value, y = as.factor(input_id))) +
geom_density_ridges(mapping = aes(fill = outcome),
alpha = 0.5) +
facet_wrap(~input_bin, scales = "free_y") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank())
## Picking joint bandwidth of 0.0677
ready_x_B %>%
ggplot(mapping = aes(x = x12, y = x13)) +
geom_point(alpha = 0.25, size = 3,
mapping = aes(color = outcome)) +
coord_equal() +
ggthemes::scale_color_colorblind() +
theme_bw() +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
The two input variables appear to be no correlation as the outcome color coding looks random. For clarification, let’s check the correlation plot below.
ready_x_B %>%
select(-outcome) %>%
cor() %>%
corrplot::corrplot(method = 'square', type = 'upper')
ready_x_B %>%
select(-outcome) %>%
cor() %>%
corrplot::corrplot(method = 'square', type = 'upper',
order = 'hclust', hclust.method = 'ward.D2')
There is no significant correlation between the input variable above
ready_x_B %>%
group_by(outcome) %>%
tidyr::nest() %>%
mutate(cor_wf = map(data, corrr::correlate, quiet = TRUE, diagonal = 1),
cor_lf = map(cor_wf, corrr::stretch)) %>%
select(outcome, cor_lf) %>%
tidyr::unnest(cor_lf) %>%
ungroup() %>%
mutate(x_id = stringr::str_extract(x, '\\d+'),
y_id = stringr::str_extract(y, '\\d+')) %>%
mutate(x_id = as.integer(x_id),
y_id = as.integer(y_id)) %>%
ggplot(mapping = aes(x = as.factor(x_id), y = as.factor(y_id))) +
geom_tile(mapping = aes(fill = r), color = 'white') +
coord_equal() +
facet_wrap(~outcome, labeller = "label_both") +
scale_fill_gradient2('corr',
low = 'red', mid='white', high='blue',
midpoint = 0,
limits = c(-1, 1)) +
labs(x='', y = '') +
theme_bw() +
theme(legend.position = "top",
axis.text = element_text(size = 5.5))
#PART B ## Overview
This RMarkdown shows how to download the final project data. It shows how to compile the two input sets with the outputs and define the regression data vs classification data. It also demonstrates how to fit a simple model (with lm()), save that model, and load it back into the work space. You may find these actions helpful as you work through the project.
This example uses the tidyverse suite of packages.
library(corrplot)
library(caret)
library(yardstick)
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
##
## Attaching package: 'yardstick'
## The following objects are masked from 'package:caret':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:readr':
##
## spec
There are multiple data sets associated with the final project. You will start with a small, simplified design. This will allow you to get practice fitting models, selecting the best model, and making predictions. You will demonstrate selecting optimal input configurations with this simplified design before tackling the more complicated larger problem.
The simplified data set is read in below. It is assumed that this markdown is located in the same directory as the data. If you want to run this markdown yourself, you should download the data sets from Canvas and place them in the same directory as this .Rmd file. It is highly recommended that you work with an RStudio RProject when working on the final project.
df_start <- readr::read_csv('small_train_data.csv', col_names = TRUE)
## Rows: 125 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (6): x07, x09, x10, x11, x21, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#There are 125 observations, 5 inputs with target variable as response in the above small data set The simplified design consists of 5 inputs, x07, x09, x10, x11, and x21, and one continuous output, response. The input variable names are consistent with the larger data set hence why their numbering does not start with x01. A glimpse of the data is given below.
df_start %>% glimpse()
## Rows: 125
## Columns: 6
## $ x07 <dbl> 0.11177674, 0.25100882, 0.65946235, 0.94918516, 0.66804925, 0~
## $ x09 <dbl> 0.56766604, 0.04336232, 0.66390665, 0.65207424, 0.78146069, 0~
## $ x10 <dbl> 0.396381547, 0.618260905, 0.048903943, 0.269819648, 0.2235553~
## $ x11 <dbl> 0.58387771, 0.53653216, 0.48976317, 0.30652758, 0.11211528, 0~
## $ x21 <dbl> 0.810788181, 0.736638473, 0.369762516, 0.926768468, 0.7161686~
## $ response <dbl> -0.292184433, 0.693360200, -0.249312839, 0.378085101, 0.74714~
##EXPLORATORY DATA ANALYSIS OF SMALL DATA
#Check for missing values
visdat::vis_miss(df_start)
#There is no missing value in the above variables
visdat::vis_dat(df_start)
#All the data type are numeric, hence linear regression model will be appropriate
df_start %>% purrr::map_dbl(n_distinct)
## x07 x09 x10 x11 x21 response
## 125 125 125 125 125 125
#There are 125 unique values for each combination of observation
df_start %>% select(-response) %>% summary()
## x07 x09 x10 x11
## Min. :0.0007088 Min. :0.007776 Min. :0.005096 Min. :0.001213
## 1st Qu.:0.2510088 1st Qu.:0.248175 1st Qu.:0.251163 1st Qu.:0.250142
## Median :0.5034108 Median :0.502097 Median :0.501573 Median :0.498635
## Mean :0.5003370 Mean :0.499948 Mean :0.499693 Mean :0.500270
## 3rd Qu.:0.7480832 3rd Qu.:0.749657 3rd Qu.:0.746016 3rd Qu.:0.748200
## Max. :0.9972816 Max. :0.994879 Max. :0.993792 Max. :0.999096
## x21
## Min. :0.004908
## 1st Qu.:0.251978
## Median :0.502134
## Mean :0.499890
## 3rd Qu.:0.747096
## Max. :0.993307
#As it can be seen above, the mean and median of the five input distributions are approximately equal indicating normal distribution across the input and do not need transformation
df_start %>% skimr::skim()
| Name | Piped data |
| Number of rows | 125 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| x07 | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| x09 | 0 | 1 | 0.50 | 0.29 | 0.01 | 0.25 | 0.50 | 0.75 | 0.99 | ▇▇▇▇▇ |
| x10 | 0 | 1 | 0.50 | 0.29 | 0.01 | 0.25 | 0.50 | 0.75 | 0.99 | ▇▇▇▇▇ |
| x11 | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 1.00 | ▇▇▇▇▇ |
| x21 | 0 | 1 | 0.50 | 0.29 | 0.00 | 0.25 | 0.50 | 0.75 | 0.99 | ▇▇▇▇▇ |
| response | 0 | 1 | 0.05 | 0.78 | -1.74 | -0.55 | 0.37 | 0.64 | 1.31 | ▂▃▃▇▅ |
df_start %>% select(-response) %>%
GGally::ggpairs(progress = FALSE, diag = list(continuous = GGally::wrap('barDiag',bins=25))) +
theme_bw()
df_start %>% cor() %>%
corrplot(method="color")
#The figure above indicate we do not have any significant input correllation between the input variables but the distribution looks uniform instead of gaussian
#Fitting the linear model
set.seed(12)
FitLinearModel01 <- lm(response ~ x07 + x09 + x10 + x11 + x21, data = df_start, model = TRUE)
FitLinearModel01 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel01 %>% readr::write_rds("my_1simple_model.rds")
#All pair-wise interaction with the 5 inputs
set.seed(12)
FitLinearModel02 <- lm(response ~ (x07 + x09 + x10 + x11 + x21)^2, data = df_start, model = TRUE)
FitLinearModel02 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel02 %>% readr::write_rds("my_2simple_model.rds")
#All cubic interaction
set.seed(12)
FitLinearModel03 <- lm(response ~ (x07 + x09 + x10 + x11 + x21)^3, data = df_start, model = TRUE)
FitLinearModel03 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel03 %>% readr::write_rds("my_3simple_model.rds")
set.seed(12)
FitLinearModel04 <- lm(response ~ (x07 + I(x09^2) + I(x10^3) + I(x11^4) + x21) * (x07 + I(x09^2) + I(x10^3) + I(x11^4) + x21), data = df_start, model = TRUE)
FitLinearModel04 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel04 %>% readr::write_rds("my_4simple_model.rds")
set.seed(12)
FitLinearModel05 <- lm(response ~ (x07 + I(x09^2) + I(x10^3) + x11 + I(x21^4)) * (x07 + I(x09^2) + I(x10^3) + x11 + I(x21^4)), data = df_start, model = TRUE)
FitLinearModel05 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel05 %>% readr::write_rds("my_5simple_model.rds")
set.seed(12)
FitLinearModel06 <- lm(response ~ splines::ns(x09 + x11,df=12) * (x07 + I(x21^2) + I(x10^3)+ I(x09^4)), data = df_start, model = TRUE)
FitLinearModel06 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel06 %>% readr::write_rds("my_6simple_model.rds")
h1 = FitLinearModel01 %>% broom::glance() %>% mutate(model = 'FitLinearModel01')
h2 = FitLinearModel02 %>% broom::glance() %>% mutate(model = 'FitLinearModel02')
h3 = FitLinearModel03 %>% broom::glance() %>% mutate(model = 'FitLinearModel03')
h4 = FitLinearModel04 %>% broom::glance() %>% mutate(model = 'FitLinearModel04')
h5 = FitLinearModel05 %>% broom::glance() %>% mutate(model = 'FitLinearModel05')
h6 = FitLinearModel06 %>% broom::glance() %>% mutate(model = 'FitLinearModel06')
Result_output = rbind(h1,h2,h3,h4,h5,h6)
Model_Result = Result_output %>% select(model,r.squared,AIC,BIC)
Model_Result
## # A tibble: 6 x 4
## model r.squared AIC BIC
## <chr> <dbl> <dbl> <dbl>
## 1 FitLinearModel01 0.806 102. 122.
## 2 FitLinearModel02 0.952 -52.0 -3.92
## 3 FitLinearModel03 0.966 -77.6 -1.26
## 4 FitLinearModel04 0.748 154. 202.
## 5 FitLinearModel05 0.942 -29.4 18.6
## 6 FitLinearModel06 0.992 -181. 5.34
model_metric <- function(mod, mod_name)
{
broom::glance(mod) %>%
mutate(model_name = mod_name)
}
Resultmodel <- purrr::map2_dfr(list(FitLinearModel01,FitLinearModel02,FitLinearModel03,FitLinearModel04,FitLinearModel05,FitLinearModel06),
sprintf('mod-%02d', 1:6),model_metric)
Resultmodel %>%
ggplot(mapping = aes(x = model_name, y = r.squared)) +
geom_linerange(mapping = aes(ymin = 0,
ymax = r.squared)) +
geom_point(size = 4.5) +
labs(x = '') +
theme_bw()
Resultmodel %>%
select(model_name, r.squared, AIC, BIC) %>%
pivot_longer(!c("model_name")) %>%
mutate(model_id = stringr::str_extract(model_name, "\\d+")) %>%
ggplot(mapping = aes(x = model_id, y = value)) +
geom_point(size = 3.5) +
facet_wrap(~name, scales = "free_y") +
labs(x = '') +
theme_bw()
set.seed(12)
train_id <- sample(1:nrow(df_start), size = floor(0.6 * nrow(df_start)))
train_ready <- df_start %>% slice(train_id)
test_ready <- df_start %>% slice(-train_id)
fit_and_assess <- function(a_formula, model_name, train_data, test_data, y_name)
{
mod <- lm( a_formula, data = train_data)
pred_train <- as.vector(mod$fitted.values)
y_train <- train_data %>% dplyr::select(all_of(y_name)) %>% pull()
train_metrics <- tibble::tibble(
rmse_value = rmse_vec(y_train, pred_train),
mae_value = mae_vec(y_train, pred_train),
r2_value = rsq_vec(y_train, pred_train)
)
pred_test <- as.vector(predict(mod, newdata = test_data))
y_test <- test_data %>% dplyr::select(all_of(y_name)) %>% pull()
test_metrics <- tibble::tibble(
rmse_value = rmse_vec(y_test, pred_test),
mae_value = mae_vec(y_test, pred_test),
r2_value = rsq_vec(y_test, pred_test)
)
train_metrics %>% mutate(on_set = "train") %>%
bind_rows(test_metrics %>% mutate(on_set = "test")) %>%
mutate(model_name = model_name)
}
set.seed(12)
one_split_results <- purrr::map2_dfr(list(formula(FitLinearModel01), formula(FitLinearModel02), formula(FitLinearModel03),
formula(FitLinearModel04), formula(FitLinearModel05), formula(FitLinearModel06)),
sprintf("mod-%02d", 1:6),
fit_and_assess,
train_data = train_ready,
test_data = test_ready,
y_name = "response")
one_split_results %>%
mutate(model_id = stringr::str_extract(model_name, "\\d+")) %>%
ggplot(mapping = aes(x = model_id, y = rmse_value)) +
geom_line(mapping = aes(color = on_set,
group = on_set),
size = 1.1) +
geom_point(mapping = aes(color = on_set),
size = 2.5) +
scale_color_brewer("", palette = "Set1") +
labs(x = 'model') +
theme_bw()
#As shown above, using all the training set alone for modelling with R-squared metrics, FitLinearModel06 performs the best because of 0.99 r.squraed value close to 1. #However, considering other metrics with resampling approach, the FitLinearModel02 and FitLinearModel03 perfoms has the best two models going by the model performance #according to the AIC, BIC, and rmse metrics. Hence, FitLinearModel02 and FitLinearModel03 are the best models, the below shows the bayesian modelling of the best two models.
#Therefore, the best model among the 6 models is All pair-wise interactions with the 5 inputs, which is FitLinearModel02.
#The coefficient summary of the best two models is shown below
FitLinearModel02 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
FitLinearModel03 %>% coefplot::coefplot()
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.
#The confidence interval coefficients on FitLinearModel03 is wider than the FitLinearModel02, this represent more uncertainty in the beta coefficients while the confidence interval coefficients on FitLinearModel02 is more certain than the FitLinearModel03.
You will begin the project by fitting linear models to predict the output, response, based on the 5 inputs in the small design of 125 observations.
Let’s fit a simple model to this data set. I do not recommend the following model. It is just to demonstrate fitting a model and the code to save that model.
set.seed(12)
my_ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5, search = "random")
my_metric <- "RMSE"
fitLinear_Model <- caret::train(response ~ (x07 + x09 + x10 + x11 + x21)^3,
data = df_start,
method = 'glmnet',
metric = my_metric,
preProcess = c('center', 'scale'),
trControl = my_ctrl)
print(fitLinear_Model)
## glmnet
##
## 125 samples
## 5 predictor
##
## Pre-processing: centered (25), scaled (25)
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 100, 100, 101, 100, 99, 101, ...
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.1693481 0.316801210 0.2760581 0.9120993 0.2287535
## 0.2693819 0.004890525 0.1881563 0.9455716 0.1485987
## 0.9426217 0.001325400 0.1820536 0.9486266 0.1441553
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.9426217 and lambda
## = 0.0013254.
set.seed(12)
min_lambda <- log(min(fitLinear_Model$results$lambda))
max_lambda <- log(max(fitLinear_Model$results$lambda))
enet_grid <- expand.grid(alpha = seq(0.1, 0.9, length.out = 9),
lambda = exp(seq(min_lambda, max_lambda, length.out = 25)))
enet_grid %>% glimpse()
## Rows: 225
## Columns: 2
## $ alpha <dbl> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.1, 0.2, 0.3, 0.4~
## $ lambda <dbl> 0.001325400, 0.001325400, 0.001325400, 0.001325400, 0.001325400~
enet_tune <- caret::train(response ~ (x07 + x09 + x10 + x11 + x21)^3, data = df_start, method = 'glmnet', metric = my_metric, preProcess = c('center', 'scale'), trControl = my_ctrl, tuneGrid = enet_grid)
#print(enet_tune)
ggplot(varImp(enet_tune))
print(enet_tune$bestTune)
## alpha lambda
## 201 0.9 0.0013254
plot(enet_tune, xTrans = log)
coef(enet_tune$finalModel, enet_tune$bestTune$lambda)
## 26 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.047534005
## x07 .
## x09 0.082632186
## x10 0.180095231
## x11 0.128688547
## x21 -0.001558126
## x07:x09 .
## x07:x10 0.003073261
## x07:x11 0.042399214
## x07:x21 -0.097540280
## x09:x10 -0.092111308
## x09:x11 -0.944200410
## x09:x21 .
## x10:x11 -0.213568958
## x10:x21 .
## x11:x21 .
## x07:x09:x10 -0.027782940
## x07:x09:x11 .
## x07:x09:x21 0.141514535
## x07:x10:x11 -0.064729029
## x07:x10:x21 0.069826365
## x07:x11:x21 0.013032821
## x09:x10:x11 0.269472672
## x09:x10:x21 -0.021598713
## x09:x11:x21 -0.076774922
## x10:x11:x21 -0.017254447
plot(caret::varImp(enet_tune))
#Bayesian Linear modelling fitting
baye <- model.matrix( response ~ (x07 + x09 + x10 + x11 + x21)^3, data = df_start, model = TRUE)
needed_info <- list(
yobs = df_start$response,
design_matrix = baye,
mu_beta = 0,
tau_beta = 1,
sigma_rate = 1
)
baye1 <- model.matrix( response ~ (x07 + x09 + x10 + x11 + x21)^2, data = df_start, model = TRUE)
needed_info1 <- list(
yobs = df_start$response,
design_matrix = baye1,
mu_beta = 0,
tau_beta = 1,
sigma_rate = 1
)
#Defining the logPostFunction
my1_logpost <- function(unknowns, my_info)
{
# specify the number of unknown beta parameters
length_beta <- ncol(my_info$design_matrix)
# extract the beta parameters from the `unknowns` vector
beta_v <- unknowns[1:length_beta ]
# extract the unbounded noise parameter, varphi
lik_varphi <- unknowns[length_beta+1]
# back-transform from varphi to sigma
lik_sigma <- exp(lik_varphi)
# extract design matrix
X <- my_info$design_matrix
# calculate the linear predictor
mu <- as.vector(X %*% as.matrix(beta_v))
# evaluate the log-likelihood
log_lik <- sum(dnorm(x=my_info$yobs, mean = mu, sd=lik_sigma, log= TRUE))
# evaluate the log-prior
log_prior_beta <- sum(dnorm(x=beta_v, mean =my_info$mu_beta, sd=my_info$tau_beta, log= TRUE))
log_prior_sigma <- dexp(x=lik_sigma,rate = my_info$sigma_rate, log = TRUE)
# add the mean trend prior and noise prior together
log_prior <- log_prior_beta + log_prior_sigma
# account for the transformation
log_derive_adjust <- lik_varphi
# sum together
log_lik + log_prior + log_derive_adjust
}
#Executing the laplace approximation
my_laplace <- function(start_guess, logpost_func, ...)
{
# code adapted from the `LearnBayes`` function `laplace()`
fit <- optim(start_guess,
logpost_func,
gr = NULL,
...,
method = "BFGS",
hessian = TRUE,
control = list(fnscale = -1, maxit = 1001))
mode <- fit$par
post_var_matrix <- -solve(fit$hessian)
p <- length(mode)
int <- p/2 * log(2 * pi) + 0.5 * log(det(post_var_matrix)) + logpost_func(mode, ...)
# package all of the results into a list
list(mode = mode,
var_matrix = post_var_matrix,
log_evidence = int,
converge = ifelse(fit$convergence == 0,
"YES",
"NO"),
iter_counts = as.numeric(fit$counts[1]))
}
#Bayesian Fitting of FitLinearModel03
set.seed(12)
laplaceBaye7 <- my_laplace(rep(0, ncol(baye)+1), my1_logpost, needed_info)
laplaceBaye7$converge
## [1] "YES"
laplaceBaye7$mode
## [1] 0.62143452 -0.11894559 0.18206606 0.61142403 0.37144075 -0.06857005
## [7] 0.18588648 0.16409100 0.41021553 -0.52661935 -0.50032685 -3.88083235
## [13] 0.24367395 -0.97981512 0.15112648 0.01526941 -0.18038292 -0.35152620
## [19] 0.94952752 -0.57466339 0.43940346 0.29673833 1.86941827 -0.29285437
## [25] -0.88672895 -0.16062722 -1.87782929
sqrt(diag(laplaceBaye7$var_matrix))
## [1] 0.15807707 0.27661774 0.28281860 0.24064783 0.30485662 0.27957751
## [7] 0.41295358 0.38360122 0.45304671 0.43871521 0.42089642 0.45575611
## [13] 0.44687423 0.42580528 0.41185681 0.37734035 0.47854735 0.47056972
## [19] 0.54324746 0.49453238 0.54234828 0.49565729 0.52304114 0.54192217
## [25] 0.53817062 0.49610358 0.06746234
laplaceBaye7 %>% readr::write_rds("my_7Bsimple_model.rds")
#Bayesian Fitting of FitLinearModel02
set.seed(12)
laplaceBaye8 <- my_laplace(rep(0, ncol(baye1)+1), my1_logpost, needed_info1)
laplaceBaye8$converge
## [1] "YES"
laplaceBaye8$mode
## [1] 0.67087042 -0.14297578 -0.14183418 0.52369789 0.33736122 -0.12311861
## [7] 0.41049157 -0.06481048 -0.07163802 0.23397481 0.13115557 -3.37305985
## [13] 0.15164782 -0.51884727 0.03777686 -0.26445925 -1.75748557
sqrt(diag(laplaceBaye8$var_matrix))
## [1] 0.13986165 0.18815237 0.17998965 0.17970380 0.20235159 0.18737309
## [7] 0.18078688 0.19689709 0.18235542 0.19939621 0.19226149 0.19372592
## [13] 0.21033917 0.18981663 0.19962608 0.19903720 0.06368765
laplaceBaye8 %>% readr::write_rds("my_8Bsimple_model.rds")
viz_post_coefs <- function(post_means, post_sds, xnames)
{
tibble::tibble(
mu = post_means,
sd = post_sds,
x = xnames
) %>%
mutate(x = factor(x, levels = xnames)) %>%
ggplot(mapping = aes(x = x)) +
geom_hline(yintercept = 0, color = 'grey', linetype = 'dashed') +
geom_point(mapping = aes(y = mu)) +
geom_linerange(mapping = aes(ymin = mu - 2 * sd,
ymax = mu + 2 * sd,
group = x)) +
labs(x = 'feature', y = 'coefficient value') +
coord_flip() +
theme_bw()
}
viz_post_coefs(laplaceBaye7$mode[1 : length(laplaceBaye7$mode) - 1], sqrt(diag(laplaceBaye7$var_matrix))[1 : length(sqrt(diag(laplaceBaye7$var_matrix))) - 1], colnames(needed_info$design_matrix))
viz_post_coefs(laplaceBaye8$mode[1 : length(laplaceBaye8$mode) - 1], sqrt(diag(laplaceBaye8$var_matrix))[1 : length(sqrt(diag(laplaceBaye8$var_matrix))) - 1], colnames(needed_info1$design_matrix))
generate_lm_post_samples <- function(mvn_result, length_beta, num_samples)
{
MASS::mvrnorm(n = num_samples,
mu = mvn_result$mode,
Sigma = mvn_result$var_matrix) %>%
as.data.frame() %>% tibble::as_tibble() %>%
purrr::set_names(c(sprintf("beta_%02d", 0:(length_beta-1)), "varphi")) %>%
mutate(sigma = exp(varphi))
}
#Uncertainty observation in the posterior prediction of the residual error of the best two models is shown below:
set.seed(12)
post_samples_Baye7 <- generate_lm_post_samples(laplaceBaye7, ncol(baye), 2500)
post_samples_Baye8 <- generate_lm_post_samples(laplaceBaye8, ncol(baye1), 2500)
post_samples_Baye7 %>%
ggplot(mapping = aes(x = sigma)) +
geom_histogram(bins = 55) +
theme_bw()
post_samples_Baye8 %>%
ggplot(mapping = aes(x = sigma)) +
geom_histogram(bins = 55) +
theme_bw()
#The posterior uncertainty prediction due to bayesian FitLinearModel03 ranges from about 0.13 to 0.18 (0.18-0.13 = 0.05) while the one due to FitLinearModel02 ranges #from 0.14 to 0.12 (0.14-0.12 = 0.02). These findings give credence to the above coefficient summary in the non-bayesian modelling.
viz_grid <- expand.grid(x07 = seq(from = 0, to = 1, length.out=6),
x09 = seq(from = 0, to = 1, length.out=50),
x10 = seq(from = 0, to = 1, length.out=6),
x11 = seq(from = 0, to = 1, length.out=6),
x21 = seq(from = 0, to = 1, length.out=6),
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE) %>%
as.data.frame() %>% tibble::as_tibble()
viz_grid
## # A tibble: 64,800 x 5
## x07 x09 x10 x11 x21
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0
## 2 0.2 0 0 0 0
## 3 0.4 0 0 0 0
## 4 0.6 0 0 0 0
## 5 0.8 0 0 0 0
## 6 1 0 0 0 0
## 7 0 0.0204 0 0 0
## 8 0.2 0.0204 0 0 0
## 9 0.4 0.0204 0 0 0
## 10 0.6 0.0204 0 0 0
## # ... with 64,790 more rows
tidy_predict <- function(mod, xnew)
{
pred_df <- predict(mod, xnew, interval = "confidence") %>%
as.data.frame() %>% tibble::as_tibble() %>%
dplyr::select(pred = fit, ci_lwr = lwr, ci_upr = upr) %>%
bind_cols(predict(mod, xnew, interval = 'prediction') %>%
as.data.frame() %>% tibble::as_tibble() %>%
dplyr::select(pred_lwr = lwr, pred_upr = upr))
xnew %>% bind_cols(pred_df)
}
pred_lm_02 <- tidy_predict(FitLinearModel02,viz_grid)
pred_lm_03 <- tidy_predict(FitLinearModel03,viz_grid)
pred_lm_02 %>%
ggplot(mapping = aes(x = x09)) +
geom_line(mapping = aes(y = pred))+
geom_ribbon(mapping = aes(ymin = pred_lwr, ymax = pred_upr), fill = 'Orange')+
geom_ribbon(mapping = aes(ymin = ci_lwr, ymax = ci_upr), fill = 'Grey') +
coord_cartesian(ylim = c(-7,7)) +
facet_wrap(facets = 'x11')
pred_lm_03 %>%
ggplot(mapping = aes(x = x09)) +
geom_line(mapping = aes(y = pred))+
geom_ribbon(mapping = aes(ymin = pred_lwr, ymax = pred_upr), fill = 'Orange')+
geom_ribbon(mapping = aes(ymin = ci_lwr, ymax = ci_upr), fill = 'Grey') +
coord_cartesian(ylim = c(-7,7)) +
facet_wrap(facets = 'x11')
The predictive trends between the top 2 models above are different, as seen ealier in their coefficient summary, the confidence interval in the pred_lm_03, corresponding to the FitLinearModel03, is more wider to overwhelm the predictive interval. However, the pred_lm_02, corresponding to the FitLinearModel02, is easily seen and do not overwhelm the predictive interval, an indication of model consistency and certainty. The input values 1 correspond to minimizing the continuous output # and the confidence interval becomes unstable at this value of input.
laplaceBaye7 <- readr::read_rds("my_7Bsimple_model.rds")
laplaceBaye8 <- readr::read_rds("my_8Bsimple_model.rds")
#PART C ## Load packages
This example uses the tidyverse suite of packages.
library(corrplot)
library(caret)
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.10 v recipes 0.1.16
## v dials 0.0.9 v rsample 0.1.0
## v infer 1.0.0 v tune 0.1.6
## v modeldata 0.1.1 v workflows 0.2.3
## v parsnip 0.1.7 v workflowsets 0.1.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x caret::lift() masks purrr::lift()
## x yardstick::precision() masks caret::precision()
## x yardstick::recall() masks caret::recall()
## x yardstick::sensitivity() masks caret::sensitivity()
## x yardstick::spec() masks readr::spec()
## x yardstick::specificity() masks caret::specificity()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
After you have gone through fitting non-Bayesian and Bayesian linear models with the small simplified design, you will work with the larger more challenging data.
The large data are divided into three data sets because as part of the project you must consider two different sets of input features. You must train models to predict the responses as a function of the first input set, “the x-variables”, then you must train models to predict the responses as a function of the second input set, “the v-variables”. You will identify which input set produces models with better performance, or if the input set ultimately does not matter.
The first input set, “the x-variables” are loaded below.
train_x <- readr::read_csv("train_input_set_x.csv", col_names = TRUE)
## Rows: 1899 Columns: 44
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (44): run_id, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
A glimpse of the “x-variable” training data are shown below. The first column run_id is NOT an input that you should consider. The run_id column is a unique identifier (a key) that uniquely defines each row in the data. There are 43 “x-variable” inputs with names x01 through x43.
train_x %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ run_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, ~
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.452833469~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.797549~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0.7~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0.7~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0.1~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.039929672~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.699955~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.707366~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.0~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.4~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0.2~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0.6~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0.3~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0.9~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0.9~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0.3~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.671517414~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0.7~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0.2~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.956529087~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0.1~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0.7~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.788406~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0.5~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.422873248~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0.7~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0.7~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0.1~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0.6~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.262299~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0.5~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0.4~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0.3~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.011386~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.317191375~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0.9~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.613027684~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0.2~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.4~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.063404346~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.758328~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.217674308~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0.5~
The second input set, “the v-variables”, are read in below.
train_v <- readr::read_csv("train_input_set_v.csv", col_names = TRUE)
## Rows: 1899 Columns: 42
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (42): run_id, v01, v02, v03, v04, v05, v06, v07, v08, v09, v10, v11, v12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
The glimpse shown below again reveals that the first column is the identifier run_id. There are 41 “v-variables” with names v01 through v41.
train_v %>% glimpse()
## Rows: 1,899
## Columns: 42
## $ run_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, ~
## $ v01 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.0~
## $ v02 <dbl> 0.59176441, 0.05360364, 0.78632885, 0.49152356, 0.15821610, 0.0~
## $ v03 <dbl> 0.99866539, 0.74442030, 0.19855145, 0.10180881, 0.16904275, 0.2~
## $ v04 <dbl> 0.51914601, 0.08378037, 0.62221318, 0.49275614, 0.08277755, 0.0~
## $ v05 <dbl> 0.98440052, 0.73504281, 0.27906224, 0.22789122, 0.14089563, 0.3~
## $ v06 <dbl> 0.27761280, 0.17613618, 0.09989673, 0.38340545, 0.20285082, 0.0~
## $ v07 <dbl> 0.72065775, 0.76673777, 0.52718254, 0.68876044, 0.18952144, 0.3~
## $ v08 <dbl> 0.25497542, 0.21544020, 0.20828020, 0.37008161, 0.24295170, 0.0~
## $ v09 <dbl> 0.74392608, 0.79567991, 0.53658142, 0.72114180, 0.17622231, 0.3~
## $ v10 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0.0~
## $ v11 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.4~
## $ v12 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0.0~
## $ v13 <dbl> 0.974760511, 0.973723716, 0.998414906, 0.698134370, 0.008995403~
## $ v14 <dbl> 0.6033220, 0.3995356, 0.2323312, 0.6210052, 0.1473009, 0.143937~
## $ v15 <dbl> 0.961458306, 0.994421267, 0.994419338, 0.480866206, 0.022627107~
## $ v16 <dbl> 0.5845183, 0.3929352, 0.2359107, 0.5664388, 0.1438076, 0.126914~
## $ v17 <dbl> 0.9913476989, 0.9990456588, 0.9982807708, 0.3484922944, 0.05527~
## $ v18 <dbl> 0.58758891, 0.46163856, 0.46932273, 0.66083964, 0.17190848, 0.1~
## $ v19 <dbl> 0.992575596, 0.998326752, 0.997408082, 0.384090286, 0.070633595~
## $ v20 <dbl> 0.62007956, 0.43412785, 0.39534349, 0.57752233, 0.15817533, 0.2~
## $ v21 <dbl> 0.995455551, 0.998267938, 0.999570601, 0.419018421, 0.040337560~
## $ v22 <dbl> 0.67064607, 0.41399999, 0.37593184, 0.53785166, 0.15708468, 0.2~
## $ v23 <dbl> 0.96558251, 0.99883838, 0.40326323, 0.39321015, 0.06023979, 0.4~
## $ v24 <dbl> 0.8126079, 0.3969788, 0.3570253, 0.6673296, 0.1941783, 0.231481~
## $ v25 <dbl> 0.82547824, 0.99934056, 0.83645894, 0.53876807, 0.02006501, 0.6~
## $ v26 <dbl> 0.7898702, 0.4028064, 0.4261415, 0.7360598, 0.2085493, 0.237749~
## $ v27 <dbl> 0.72778006, 0.99861090, 0.83560109, 0.75246718, 0.02473616, 0.7~
## $ v28 <dbl> 0.7754927, 0.4112850, 0.4358383, 0.7761393, 0.2282164, 0.256831~
## $ v29 <dbl> 0.99418321, 0.98082307, 0.82981961, 0.10035939, 0.01207357, 0.8~
## $ v30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.262299~
## $ v31 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.006289228~
## $ v32 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.006289228~
## $ v33 <dbl> 0.743168039, 0.169432565, 0.990940250, 0.904737500, 0.966889220~
## $ v34 <dbl> 0.35529833, 0.17737909, 0.45506100, 0.49154722, 0.41835025, 0.3~
## $ v35 <dbl> 0.68272602, 0.20804263, 0.98143915, 0.88001101, 0.95389492, 0.1~
## $ v36 <dbl> 0.29133870, 0.15346219, 0.37971022, 0.51743062, 0.55531942, 0.3~
## $ v37 <dbl> 0.601994815, 0.329438589, 0.984413079, 0.862959415, 0.959366789~
## $ v38 <dbl> 0.28843957, 0.15620445, 0.44257632, 0.57036343, 0.59601264, 0.4~
## $ v39 <dbl> 0.572457748, 0.352100439, 0.988011127, 0.861117030, 0.963308649~
## $ v40 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.4~
## $ v41 <dbl> 0.57544185, 0.34392306, 0.98759446, 0.86143126, 0.96424137, 0.2~
The training outputs are read in for you below.
train_outputs <- readr::read_csv("train_outputs.csv", col_names = TRUE)
## Rows: 1899 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): outcome
## dbl (2): run_id, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
The train_outputs dataframe has 3 columns. The first column is again run_id the unique identifier per row. The second column, response, is the continuous output. The third column, outcome, is the discrete binary outcome. The glimpse of train_outputs is shown below.
train_outputs %>% glimpse()
## Rows: 1,899
## Columns: 3
## $ run_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18~
## $ response <dbl> 0.74119845, 0.33409119, 0.53881939, 0.06878022, 0.46564607, 0~
## $ outcome <chr> "event", "non_event", "event", "event", "non_event", "non_eve~
The unique identifier, run_id, is included in all three data sets associated with the complete or large problem. The data sets can be joined or merged to create the complete set of inputs and outputs as required.
The complete training set of all “x-variable” inputs and the continuous output are compiled below. After joining the data together, the run_id and outcome columns are removed to provide you a data set of just “x-variable” inputs and response. The glimpse of the joined data set, ready_x_A is provided below. The continuous output, response, is the last column in the data set.
ready_x_A <- train_x %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -outcome)
ready_x_A %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.4528334~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.7975~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.0399296~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.6999~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.7073~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.6715174~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.9565290~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.7884~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.4228732~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.2622~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.0113~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.3171913~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.6130276~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.0634043~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.7583~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.2176743~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0~
## $ response <dbl> 0.74119845, 0.33409119, 0.53881939, 0.06878022, 0.46564607, 0~
#Using the 7-fold cross validation with 3 repeats
my_ctrl <- trainControl(method = "repeatedcv", number = 7, repeats = 3, search = 'random')
my_metric <- "RMSE"
#Linear additive features
set.seed(12)
fit_lm_9 <- train(response ~ .,
data = ready_x_A,
method = "lm",
metric = my_metric,
minimize = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_lm_9 %>% readr::write_rds("my_9simple_model.rds")
fit_lm_9
#All pairwise interaction between the inputs
set.seed(12)
fit_lm_10 <- train(response ~ (.)^2,
data = ready_x_A,
method = "lm",
metric = my_metric,
minimize = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_lm_10 %>% readr::write_rds("my_10simple_model.rds")
fit_lm_10
#All 12 degree of freedom splines interaction between the inputs
set.seed(12)
fit_lm_11 <- train(response ~ splines::ns(x09 + x11,df=12) * (.),
data = ready_x_A,
method = "lm",
metric = my_metric,
minimize = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_lm_11 %>% readr::write_rds("my_11simple_model.rds")
fit_lm_11
#Regularized regression with elatic-net
#All pairwise interaction between the inputs
set.seed(12)
fit_enet_12 <- train(response ~ (.)^2,
data = ready_x_A,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_enet_12 %>% readr::write_rds("my_12simple_model.rds")
fit_enet_12
#All cubic interaction between the inputs
set.seed(12)
fit_enet_13 <- train(response ~ (.)^3,
data = ready_x_A,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_enet_13 %>% readr::write_rds("my_13simple_model.rds")
fit_enet_13
#Neural network with Linear additive features
set.seed(12)
fit_nnet_14 <- train(response ~ .,
data = ready_x_A,
method = "nnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE,
linout = TRUE)
fit_nnet_14 %>% readr::write_rds("my_14simple_model.rds")
fit_nnet_14
#Random Forest with Linear additive features
set.seed(12)
fit_rf15 <- train(response ~ .,
data = ready_x_A,
method = "rf",
metric = my_metric,
trControl = my_ctrl,
importance = TRUE)
fit_rf15 %>% readr::write_rds("my_15simple_model.rds")
fit_rf15
set.seed(12)
fit_xgb16 <- train(response ~ .,
data = ready_x_A,
method = "xgbTree",
metric = my_metric,
trControl = my_ctrl,
objective = 'reg:squarederror')
fit_xgb16 %>% readr::write_rds("my_16simple_model.rds")
#Gradient boosted tree
xgb_grid <- expand.grid(nrounds = seq(100, 700, by = 100),
max_depth = c(3, 4, 9),
eta = c(0.5*fit_xgb16$bestTune$eta, fit_xgb16$bestTune$eta),
gamma = fit_xgb16$bestTune$gamma,
colsample_bytree = fit_xgb16$bestTune$colsample_bytree,
min_child_weight = fit_xgb16$bestTune$min_child_weight,
subsample = fit_xgb16$bestTune$subsample)
set.seed(12)
fit_xgb16T <- train(response ~ .,
data = ready_x_A,
method = "xgbTree",
tuneGrid = xgb_grid,
metric = my_metric,
trControl = my_ctrl,
objective = 'reg:squarederror')
fit_xgb16T %>% readr::write_rds("my_16Tsimple_model.rds")
fit_xgb16T$bestTune
fit_xgb16T$bestTune
set.seed(12)
fit_svm17 <- train(response ~ (.)^2,
data = ready_x_A,
method = "svmRadial",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_svm17 %>% readr::write_rds("my_17simple_model.rds")
fit_svm17
set.seed(12)
fit_pca_nnet18 <- train(response ~ .,
data = ready_x_A,
method = "pcaNNet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE,
linout = TRUE)
fit_pca_nnet18 %>% readr::write_rds("my_18simple_model.rds")
fit_pca_nnet18
#Loading of the model
fit_lm_9 <- readr::read_rds("my_9simple_model.rds")
fit_lm_10 <- readr::read_rds("my_10simple_model.rds")
fit_lm_11 <- readr::read_rds("my_11simple_model.rds")
fit_enet_12 <- readr::read_rds("my_12simple_model.rds")
fit_enet_13 <- readr::read_rds("my_13simple_model.rds")
fit_nnet_14 <- readr::read_rds("my_14simple_model.rds")
fit_rf15 <- readr::read_rds("my_15simple_model.rds")
fit_xgb16 <- readr::read_rds("my_16simple_model.rds")
fit_xgb16T <- readr::read_rds("my_16Tsimple_model.rds")
fit_svm17 <- readr::read_rds("my_17simple_model.rds")
fit_pca_nnet18 <- readr::read_rds("my_18simple_model.rds")
`
my_resultsXRegression <- resamples(list(LM_9 = fit_lm_9,
LM_10 = fit_lm_10,
SPLINE_11 = fit_lm_11,
ENET_12 = fit_enet_12,
ENET_13 = fit_enet_13,
NNET14 = fit_nnet_14,
RF15 = fit_rf15,
XGB16 = fit_xgb16,
XGB16T = fit_xgb16T,
SVM17 = fit_svm17,
PCA_NNET18 = fit_pca_nnet18))
dotplot(my_resultsXRegression)
dotplot(my_resultsXRegression, metric = "RMSE")
dotplot(my_resultsXRegression, metric = "Rsquared")
plot(varImp(fit_nnet_14))
varImp(fit_nnet_14, scale = FALSE)
## nnet variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## x09 8.131
## x11 7.313
## x10 5.580
## x05 3.744
## x08 3.538
## x06 2.943
## x17 2.472
## x36 2.420
## x38 2.315
## x14 2.311
## x19 2.222
## x04 2.192
## x18 2.165
## x43 2.143
## x25 2.138
## x03 2.134
## x07 2.127
## x27 2.109
## x39 2.103
## x41 2.078
plot(fit_nnet_14, top = 20)
splom(my_resultsXRegression)
densityplot(my_resultsXRegression)
bwplot(my_resultsXRegression, layout = c(3, 1))
#Function to make prediction
make_variable_sequence <- function(xname, xvalues, primary_vars, secondary_vars)
{
if( xname %in% primary_vars ){
xrange <- range(xvalues)
xvec <- seq(xrange[1], xrange[2], length.out = 51)
} else if ( xname %in% secondary_vars ) {
xrange <- range(xvalues)
xvec <- seq(xrange[1], xrange[2], length.out = 5)
} else {
xvec <- median(xvalues)
}
xvec
}
make_viz_grid_list <- function(primary_vars, secondary_vars, training_inputs)
{
all_names <- training_inputs %>% names()
xlist <- purrr::map2(all_names,
training_inputs,
make_variable_sequence,
primary_vars = primary_vars,
secondary_vars = secondary_vars)
names(xlist) <- all_names
xlist
}
viz_grid_list <- make_viz_grid_list(primary_vars = c("x11", "x09"),
secondary_vars = c("x10", "x21"),
training_inputs = ready_x_A )
viz_grid_list %>% length()
## [1] 44
viz_grid_xvariable <- expand.grid(viz_grid_list,
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE) %>%
as.data.frame() %>% tibble::as_tibble()
viz_grid_xvariable %>% glimpse()
## Rows: 65,025
## Columns: 44
## $ x01 <dbl> 0.5249783, 0.5249783, 0.5249783, 0.5249783, 0.5249783, 0.5249~
## $ x02 <dbl> 0.4942061, 0.4942061, 0.4942061, 0.4942061, 0.4942061, 0.4942~
## $ x03 <dbl> 0.478837, 0.478837, 0.478837, 0.478837, 0.478837, 0.478837, 0~
## $ x04 <dbl> 0.4810649, 0.4810649, 0.4810649, 0.4810649, 0.4810649, 0.4810~
## $ x05 <dbl> 0.432994, 0.432994, 0.432994, 0.432994, 0.432994, 0.432994, 0~
## $ x06 <dbl> 0.4638712, 0.4638712, 0.4638712, 0.4638712, 0.4638712, 0.4638~
## $ x07 <dbl> 0.4859985, 0.4859985, 0.4859985, 0.4859985, 0.4859985, 0.4859~
## $ x08 <dbl> 0.4404713, 0.4404713, 0.4404713, 0.4404713, 0.4404713, 0.4404~
## $ x09 <dbl> 0.00, 0.02, 0.04, 0.06, 0.08, 0.10, 0.12, 0.14, 0.16, 0.18, 0~
## $ x10 <dbl> 0.0002950899, 0.0002950899, 0.0002950899, 0.0002950899, 0.000~
## $ x11 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ x12 <dbl> 0.4843477, 0.4843477, 0.4843477, 0.4843477, 0.4843477, 0.4843~
## $ x13 <dbl> 0.5138254, 0.5138254, 0.5138254, 0.5138254, 0.5138254, 0.5138~
## $ x14 <dbl> 0.4946071, 0.4946071, 0.4946071, 0.4946071, 0.4946071, 0.4946~
## $ x15 <dbl> 0.5300603, 0.5300603, 0.5300603, 0.5300603, 0.5300603, 0.5300~
## $ x16 <dbl> 0.4948615, 0.4948615, 0.4948615, 0.4948615, 0.4948615, 0.4948~
## $ x17 <dbl> 0.4908604, 0.4908604, 0.4908604, 0.4908604, 0.4908604, 0.4908~
## $ x18 <dbl> 0.6650321, 0.6650321, 0.6650321, 0.6650321, 0.6650321, 0.6650~
## $ x19 <dbl> 0.6507062, 0.6507062, 0.6507062, 0.6507062, 0.6507062, 0.6507~
## $ x20 <dbl> 0.6673075, 0.6673075, 0.6673075, 0.6673075, 0.6673075, 0.6673~
## $ x21 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ x22 <dbl> 0.6758939, 0.6758939, 0.6758939, 0.6758939, 0.6758939, 0.6758~
## $ x23 <dbl> 0.6630225, 0.6630225, 0.6630225, 0.6630225, 0.6630225, 0.6630~
## $ x24 <dbl> 0.4895114, 0.4895114, 0.4895114, 0.4895114, 0.4895114, 0.4895~
## $ x25 <dbl> 0.4953374, 0.4953374, 0.4953374, 0.4953374, 0.4953374, 0.4953~
## $ x26 <dbl> 0.5006957, 0.5006957, 0.5006957, 0.5006957, 0.5006957, 0.5006~
## $ x27 <dbl> 0.5008897, 0.5008897, 0.5008897, 0.5008897, 0.5008897, 0.5008~
## $ x28 <dbl> 0.4870209, 0.4870209, 0.4870209, 0.4870209, 0.4870209, 0.4870~
## $ x29 <dbl> 0.5088498, 0.5088498, 0.5088498, 0.5088498, 0.5088498, 0.5088~
## $ x30 <dbl> 0.6218922, 0.6218922, 0.6218922, 0.6218922, 0.6218922, 0.6218~
## $ x31 <dbl> 0.4857729, 0.4857729, 0.4857729, 0.4857729, 0.4857729, 0.4857~
## $ x32 <dbl> 0.50202, 0.50202, 0.50202, 0.50202, 0.50202, 0.50202, 0.50202~
## $ x33 <dbl> 0.5004106, 0.5004106, 0.5004106, 0.5004106, 0.5004106, 0.5004~
## $ x34 <dbl> 0.487153, 0.487153, 0.487153, 0.487153, 0.487153, 0.487153, 0~
## $ x35 <dbl> 0.5007553, 0.5007553, 0.5007553, 0.5007553, 0.5007553, 0.5007~
## $ x36 <dbl> 0.50077, 0.50077, 0.50077, 0.50077, 0.50077, 0.50077, 0.50077~
## $ x37 <dbl> 0.4872058, 0.4872058, 0.4872058, 0.4872058, 0.4872058, 0.4872~
## $ x38 <dbl> 0.5009047, 0.5009047, 0.5009047, 0.5009047, 0.5009047, 0.5009~
## $ x39 <dbl> 0.5064742, 0.5064742, 0.5064742, 0.5064742, 0.5064742, 0.5064~
## $ x40 <dbl> 0.467539, 0.467539, 0.467539, 0.467539, 0.467539, 0.467539, 0~
## $ x41 <dbl> 0.4882291, 0.4882291, 0.4882291, 0.4882291, 0.4882291, 0.4882~
## $ x42 <dbl> 0.4970046, 0.4970046, 0.4970046, 0.4970046, 0.4970046, 0.4970~
## $ x43 <dbl> 0.4907011, 0.4907011, 0.4907011, 0.4907011, 0.4907011, 0.4907~
## $ response <dbl> 0.2513475, 0.2513475, 0.2513475, 0.2513475, 0.2513475, 0.2513~
viz_grid_xvariable %>%
mutate(pred = predict(fit_nnet_14, newdata = viz_grid_xvariable)) %>%
ggplot(mapping = aes(x = x09, y = response)) +
geom_point(alpha = 0.3, size = 1.85,
mapping = aes(color = x10)) +
geom_line(mapping = aes(y = pred,
group = interaction(x21,
x07,
x10),
color = x10),
size = 1.) +
facet_grid(x07 ~ x21, labeller = "label_both") +
scale_color_viridis_c(option = 'plasma') +
theme_bw()
# Prediction based on PCA_NNET, second best model of the x-variables
viz_grid_xvariable %>%
mutate(pred = predict(fit_pca_nnet18, newdata = viz_grid_xvariable)) %>%
ggplot(mapping = aes(x = x09, y = response)) +
geom_point(alpha = 0.3, size = 1.85,
mapping = aes(color = x10)) +
geom_line(mapping = aes(y = pred,
group = interaction(x21,
x07,
x09),
color = x09),
size = 1.) +
facet_grid(x07 ~ x21, labeller = "label_both") +
scale_color_viridis_c(option = 'plasma') +
theme_bw()
The complete training set of all “v-variable” inputs and the continuous response is created below. The steps are the same as those used to create the complete “x-variable” training set. The finalized ready training set is named ready_v_A and is shown as a glimpse below.
ready_v_A <- train_v %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -outcome)
ready_v_A %>% glimpse()
## Rows: 1,899
## Columns: 42
## $ v01 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0~
## $ v02 <dbl> 0.59176441, 0.05360364, 0.78632885, 0.49152356, 0.15821610, 0~
## $ v03 <dbl> 0.99866539, 0.74442030, 0.19855145, 0.10180881, 0.16904275, 0~
## $ v04 <dbl> 0.51914601, 0.08378037, 0.62221318, 0.49275614, 0.08277755, 0~
## $ v05 <dbl> 0.98440052, 0.73504281, 0.27906224, 0.22789122, 0.14089563, 0~
## $ v06 <dbl> 0.27761280, 0.17613618, 0.09989673, 0.38340545, 0.20285082, 0~
## $ v07 <dbl> 0.72065775, 0.76673777, 0.52718254, 0.68876044, 0.18952144, 0~
## $ v08 <dbl> 0.25497542, 0.21544020, 0.20828020, 0.37008161, 0.24295170, 0~
## $ v09 <dbl> 0.74392608, 0.79567991, 0.53658142, 0.72114180, 0.17622231, 0~
## $ v10 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0~
## $ v11 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0~
## $ v12 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0~
## $ v13 <dbl> 0.974760511, 0.973723716, 0.998414906, 0.698134370, 0.0089954~
## $ v14 <dbl> 0.6033220, 0.3995356, 0.2323312, 0.6210052, 0.1473009, 0.1439~
## $ v15 <dbl> 0.961458306, 0.994421267, 0.994419338, 0.480866206, 0.0226271~
## $ v16 <dbl> 0.5845183, 0.3929352, 0.2359107, 0.5664388, 0.1438076, 0.1269~
## $ v17 <dbl> 0.9913476989, 0.9990456588, 0.9982807708, 0.3484922944, 0.055~
## $ v18 <dbl> 0.58758891, 0.46163856, 0.46932273, 0.66083964, 0.17190848, 0~
## $ v19 <dbl> 0.992575596, 0.998326752, 0.997408082, 0.384090286, 0.0706335~
## $ v20 <dbl> 0.62007956, 0.43412785, 0.39534349, 0.57752233, 0.15817533, 0~
## $ v21 <dbl> 0.995455551, 0.998267938, 0.999570601, 0.419018421, 0.0403375~
## $ v22 <dbl> 0.67064607, 0.41399999, 0.37593184, 0.53785166, 0.15708468, 0~
## $ v23 <dbl> 0.96558251, 0.99883838, 0.40326323, 0.39321015, 0.06023979, 0~
## $ v24 <dbl> 0.8126079, 0.3969788, 0.3570253, 0.6673296, 0.1941783, 0.2314~
## $ v25 <dbl> 0.82547824, 0.99934056, 0.83645894, 0.53876807, 0.02006501, 0~
## $ v26 <dbl> 0.7898702, 0.4028064, 0.4261415, 0.7360598, 0.2085493, 0.2377~
## $ v27 <dbl> 0.72778006, 0.99861090, 0.83560109, 0.75246718, 0.02473616, 0~
## $ v28 <dbl> 0.7754927, 0.4112850, 0.4358383, 0.7761393, 0.2282164, 0.2568~
## $ v29 <dbl> 0.99418321, 0.98082307, 0.82981961, 0.10035939, 0.01207357, 0~
## $ v30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.2622~
## $ v31 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.0062892~
## $ v32 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.0062892~
## $ v33 <dbl> 0.743168039, 0.169432565, 0.990940250, 0.904737500, 0.9668892~
## $ v34 <dbl> 0.35529833, 0.17737909, 0.45506100, 0.49154722, 0.41835025, 0~
## $ v35 <dbl> 0.68272602, 0.20804263, 0.98143915, 0.88001101, 0.95389492, 0~
## $ v36 <dbl> 0.29133870, 0.15346219, 0.37971022, 0.51743062, 0.55531942, 0~
## $ v37 <dbl> 0.601994815, 0.329438589, 0.984413079, 0.862959415, 0.9593667~
## $ v38 <dbl> 0.28843957, 0.15620445, 0.44257632, 0.57036343, 0.59601264, 0~
## $ v39 <dbl> 0.572457748, 0.352100439, 0.988011127, 0.861117030, 0.9633086~
## $ v40 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0~
## $ v41 <dbl> 0.57544185, 0.34392306, 0.98759446, 0.86143126, 0.96424137, 0~
## $ response <dbl> 0.74119845, 0.33409119, 0.53881939, 0.06878022, 0.46564607, 0~
#Linear additive features
set.seed(12)
fit_lm_19 <- train(response ~ .,
data = ready_v_A,
method = "lm",
metric = my_metric,
minimize = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_lm_19 %>% readr::write_rds("my_19simple_model.rds")
fit_lm_19
#All pairwise interaction between the inputs
set.seed(12)
fit_lm_20 <- train(response ~ (.)^2,
data = ready_v_A,
method = "lm",
metric = my_metric,
minimize = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_lm_20 %>% readr::write_rds("my_20simple_model.rds")
fit_lm_20
#All cubic interaction between the inputs
set.seed(12)
fit_lm_21 <- train(response ~ splines::ns(v09 + v11,df=12) * (.),
data = ready_v_A,
method = "lm",
metric = my_metric,
minimize = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_lm_21 %>% readr::write_rds("my_21simple_model.rds")
fit_lm_21
#Regularized regression with elatic-net
#All pairwise interaction between the inputs
set.seed(12)
fit_enet_22 <- train(response ~ (.)^2,
data = ready_v_A,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_enet_22 %>% readr::write_rds("my_22simple_model.rds")
fit_enet_22
#All cubic interaction between the inputs
set.seed(12)
fit_enet_23 <- train(response ~ (.)^3,
data = ready_v_A,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_enet_23 %>% readr::write_rds("my_23simple_model.rds")
fit_enet_23
#Neural network with Linear additive features
set.seed(12)
nnet_grid <- expand.grid(size = c(2, 4, 6, 8, 10, 12),
decay = exp(seq(-6, 2, length.out = 13)))
fit_nnet_24 <- train(response ~ .,
data = ready_v_A,
method = "nnet",
metric = my_metric,
tuneGrid = nnet_grid,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE,
linout = TRUE)
fit_nnet_24 %>% readr::write_rds("my_24simple_model.rds")
fit_nnet_24
## Neural Network
##
## 1899 samples
## 41 predictor
##
## Pre-processing: centered (41), scaled (41)
## Resampling: Cross-Validated (7 fold, repeated 3 times)
## Summary of sample sizes: 1628, 1629, 1628, 1627, 1627, 1628, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 2 0.002478752 0.2183021 0.9217591 0.16461894
## 2 0.004827950 0.2120612 0.9264193 0.16016575
## 2 0.009403563 0.2144420 0.9249324 0.16332991
## 2 0.018315639 0.2085081 0.9287949 0.15771362
## 2 0.035673993 0.2073471 0.9293807 0.15873648
## 2 0.069483451 0.2066683 0.9300389 0.15802769
## 2 0.135335283 0.2056978 0.9310689 0.15648112
## 2 0.263597138 0.2034561 0.9323375 0.15484913
## 2 0.513417119 0.2022340 0.9332932 0.15432418
## 2 1.000000000 0.1992516 0.9352798 0.15234648
## 2 1.947734041 0.2046498 0.9318459 0.15610235
## 2 3.793667895 0.2093019 0.9290428 0.15985182
## 2 7.389056099 0.2196997 0.9228461 0.16776746
## 4 0.002478752 0.1877140 0.9419484 0.14303344
## 4 0.004827950 0.1804571 0.9464967 0.13634376
## 4 0.009403563 0.1804371 0.9463473 0.13632166
## 4 0.018315639 0.1801642 0.9467283 0.13709230
## 4 0.035673993 0.1783776 0.9478639 0.13410004
## 4 0.069483451 0.1734031 0.9504504 0.13085488
## 4 0.135335283 0.1735921 0.9504883 0.13081816
## 4 0.263597138 0.1690029 0.9527332 0.12745675
## 4 0.513417119 0.1735550 0.9506286 0.13015478
## 4 1.000000000 0.1776509 0.9483221 0.13405771
## 4 1.947734041 0.1822762 0.9458002 0.13676602
## 4 3.793667895 0.1979876 0.9361982 0.14967591
## 4 7.389056099 0.2160365 0.9246856 0.16366040
## 6 0.002478752 0.1572564 0.9591960 0.11907190
## 6 0.004827950 0.1541224 0.9607429 0.11574521
## 6 0.009403563 0.1584357 0.9586251 0.11911783
## 6 0.018315639 0.1529263 0.9615388 0.11434005
## 6 0.035673993 0.1468409 0.9645440 0.11034804
## 6 0.069483451 0.1465877 0.9646515 0.11027595
## 6 0.135335283 0.1444745 0.9658948 0.10792149
## 6 0.263597138 0.1493839 0.9635072 0.11160029
## 6 0.513417119 0.1583319 0.9587137 0.11829032
## 6 1.000000000 0.1660397 0.9550948 0.12469152
## 6 1.947734041 0.1721908 0.9518311 0.12891040
## 6 3.793667895 0.1934335 0.9391764 0.14546159
## 6 7.389056099 0.2155641 0.9248874 0.16279829
## 8 0.002478752 0.1500979 0.9624139 0.11330253
## 8 0.004827950 0.1417324 0.9668810 0.10670659
## 8 0.009403563 0.1450002 0.9651837 0.10896998
## 8 0.018315639 0.1345355 0.9701933 0.10115008
## 8 0.035673993 0.1346878 0.9703607 0.10148332
## 8 0.069483451 0.1310705 0.9718895 0.09815738
## 8 0.135335283 0.1333787 0.9708640 0.10009044
## 8 0.263597138 0.1391425 0.9683256 0.10408685
## 8 0.513417119 0.1501438 0.9632593 0.11223104
## 8 1.000000000 0.1573692 0.9597035 0.11793340
## 8 1.947734041 0.1729733 0.9512873 0.12942785
## 8 3.793667895 0.1888239 0.9420412 0.14151514
## 8 7.389056099 0.2142386 0.9258091 0.16187704
## 10 0.002478752 0.1390620 0.9682634 0.10518996
## 10 0.004827950 0.1311846 0.9717564 0.09901568
## 10 0.009403563 0.1260419 0.9740796 0.09515603
## 10 0.018315639 0.1275409 0.9734200 0.09710165
## 10 0.035673993 0.1296757 0.9724253 0.09680241
## 10 0.069483451 0.1239652 0.9749523 0.09287485
## 10 0.135335283 0.1260727 0.9740466 0.09413894
## 10 0.263597138 0.1342406 0.9705195 0.09992434
## 10 0.513417119 0.1412299 0.9675062 0.10527159
## 10 1.000000000 0.1573273 0.9597130 0.11744658
## 10 1.947734041 0.1702938 0.9527724 0.12712282
## 10 3.793667895 0.1901100 0.9413918 0.14234027
## 10 7.389056099 0.2147673 0.9254158 0.16222784
## 12 0.002478752 0.1251634 0.9744266 0.09423156
## 12 0.004827950 0.1322189 0.9710452 0.10015280
## 12 0.009403563 0.1250933 0.9744240 0.09431803
## 12 0.018315639 0.1244636 0.9746044 0.09317015
## 12 0.035673993 0.1211176 0.9760159 0.09149935
## 12 0.069483451 0.1205875 0.9760815 0.09015097
## 12 0.135335283 0.1249446 0.9744290 0.09273896
## 12 0.263597138 0.1281303 0.9732016 0.09616729
## 12 0.513417119 0.1400936 0.9679813 0.10472527
## 12 1.000000000 0.1529953 0.9618798 0.11483507
## 12 1.947734041 0.1669006 0.9546560 0.12480514
## 12 3.793667895 0.1864319 0.9436152 0.13986062
## 12 7.389056099 0.2148241 0.9253288 0.16219212
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 12 and decay = 0.06948345.
#Random Forest with Linear additive features
set.seed(12)
fit_rf25 <- train(response ~ .,
data = ready_v_A,
method = "rf",
metric = my_metric,
trControl = my_ctrl,
importance = TRUE)
fit_rf25 %>% readr::write_rds("my_25simple_model.rds")
fit_rf25
#Gradient boosted tree
set.seed(12)
fit_xgb26 <- train(response ~ .,
data = ready_v_A,
method = "xgbTree",
metric = my_metric,
trControl = my_ctrl,
objective = 'reg:squarederror')
fit_xgb26 %>% readr::write_rds("my_26simple_model.rds")
fit_xgb26$bestTune
xgb_grid <- expand.grid(nrounds = seq(100, 700, by = 100),
max_depth = c(3, 4, 9),
eta = c(0.5*fit_xgb26$bestTune$eta, fit_xgb26$bestTune$eta),
gamma = fit_xgb26$bestTune$gamma,
colsample_bytree = fit_xgb26$bestTune$colsample_bytree,
min_child_weight = fit_xgb26$bestTune$min_child_weight,
subsample = fit_xgb26$bestTune$subsample)
set.seed(12)
fit_xgb26T <- train(response ~ .,
data = ready_v_A,
method = "xgbTree",
tuneGrid = xgb_grid,
metric = my_metric,
trControl = my_ctrl,
objective = 'reg:squarederror')
fit_xgb26T %>% readr::write_rds("my_26Tsimple_model.rds")
fit_xgb26T$bestTune
set.seed(12)
fit_svm27 <- train(response ~ .,
data = ready_v_A,
method = "svmRadial",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl)
fit_svm27 %>% readr::write_rds("my_27simple_model.rds")
fit_svm27
set.seed(12)
fit_pca_nnet28 <- train(response ~ .,
data = ready_v_A,
method = "pcaNNet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = my_ctrl,
trace = FALSE,
linout = TRUE)
fit_pca_nnet28 %>% readr::write_rds("my_28simple_model.rds")
fit_pca_nnet28
fit_lm_19 <- readr::read_rds("my_19simple_model.rds")
fit_lm_20 <- readr::read_rds("my_20simple_model.rds")
fit_lm_21 <- readr::read_rds("my_21simple_model.rds")
fit_enet_22 <- readr::read_rds("my_22simple_model.rds")
fit_enet_23 <- readr::read_rds("my_23simple_model.rds")
fit_nnet_24 <- readr::read_rds("my_24simple_model.rds")
fit_rf25 <- readr::read_rds("my_25simple_model.rds")
fit_xgb26 <- readr::read_rds("my_26simple_model.rds")
fit_xgb26T <- readr::read_rds("my_26Tsimple_model.rds")
fit_svm27 <- readr::read_rds("my_27simple_model.rds")
fit_pca_nnet28 <- readr::read_rds("my_28simple_model.rds")
my_resultsVRegression <- resamples(list(LM_19 = fit_lm_19,
LM_20 = fit_lm_20,
SPLINE_21 = fit_lm_21,
ENET_22 = fit_enet_22,
ENET_23 = fit_enet_23,
NNET24 = fit_nnet_24,
RF25 = fit_rf25,
XGB26 = fit_xgb26,
XGB26T = fit_xgb26T,
SVM27 = fit_svm27,
PCA_NNET28 = fit_pca_nnet28))
dotplot(my_resultsVRegression)
dotplot(my_resultsVRegression, metric = "RMSE")
dotplot(my_resultsVRegression, metric = "Rsquared")
plot(varImp(fit_nnet_24))
varImp(fit_nnet_24, scale = FALSE)
## nnet variable importance
##
## only 20 most important variables shown (out of 41)
##
## Overall
## v06 6.518
## v10 6.486
## v12 6.431
## v04 6.085
## v08 5.864
## v07 3.829
## v09 3.745
## v11 3.164
## v02 2.804
## v33 2.621
## v37 2.589
## v18 2.428
## v41 2.414
## v29 2.270
## v35 2.233
## v39 2.229
## v14 2.176
## v05 2.048
## v19 2.044
## v40 1.978
plot(fit_nnet_24, top = 20)
splom(my_resultsVRegression)
densityplot(my_resultsVRegression)
bwplot(my_resultsVRegression, layout = c(3, 1))
viz_grid_list1 <- make_viz_grid_list(primary_vars = c("v04", "v07"),
secondary_vars = c("v02", "v35"),
training_inputs = ready_v_A)
viz_grid_list1 %>% length()
## [1] 42
viz_grid_Vvariable <- expand.grid(viz_grid_list1,
KEEP.OUT.ATTRS = FALSE,
stringsAsFactors = FALSE) %>%
as.data.frame() %>% tibble::as_tibble()
viz_grid_Vvariable %>% glimpse()
## Rows: 65,025
## Columns: 42
## $ v01 <dbl> 0.5034804, 0.5034804, 0.5034804, 0.5034804, 0.5034804, 0.5034~
## $ v02 <dbl> 0.0000000, 0.2498867, 0.4997734, 0.7496601, 0.9995468, 0.0000~
## $ v03 <dbl> 0.3007111, 0.3007111, 0.3007111, 0.3007111, 0.3007111, 0.3007~
## $ v04 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, 0.02, 0.02, 0.02, 0.02, 0~
## $ v05 <dbl> 0.3098033, 0.3098033, 0.3098033, 0.3098033, 0.3098033, 0.3098~
## $ v06 <dbl> 0.2857139, 0.2857139, 0.2857139, 0.2857139, 0.2857139, 0.2857~
## $ v07 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ v08 <dbl> 0.3017134, 0.3017134, 0.3017134, 0.3017134, 0.3017134, 0.3017~
## $ v09 <dbl> 0.4323631, 0.4323631, 0.4323631, 0.4323631, 0.4323631, 0.4323~
## $ v10 <dbl> 0.3154867, 0.3154867, 0.3154867, 0.3154867, 0.3154867, 0.3154~
## $ v11 <dbl> 0.5004416, 0.5004416, 0.5004416, 0.5004416, 0.5004416, 0.5004~
## $ v12 <dbl> 0.3154867, 0.3154867, 0.3154867, 0.3154867, 0.3154867, 0.3154~
## $ v13 <dbl> 0.5012841, 0.5012841, 0.5012841, 0.5012841, 0.5012841, 0.5012~
## $ v14 <dbl> 0.3160702, 0.3160702, 0.3160702, 0.3160702, 0.3160702, 0.3160~
## $ v15 <dbl> 0.5016033, 0.5016033, 0.5016033, 0.5016033, 0.5016033, 0.5016~
## $ v16 <dbl> 0.3370787, 0.3370787, 0.3370787, 0.3370787, 0.3370787, 0.3370~
## $ v17 <dbl> 0.4943832, 0.4943832, 0.4943832, 0.4943832, 0.4943832, 0.4943~
## $ v18 <dbl> 0.4275649, 0.4275649, 0.4275649, 0.4275649, 0.4275649, 0.4275~
## $ v19 <dbl> 0.4878706, 0.4878706, 0.4878706, 0.4878706, 0.4878706, 0.4878~
## $ v20 <dbl> 0.4633789, 0.4633789, 0.4633789, 0.4633789, 0.4633789, 0.4633~
## $ v21 <dbl> 0.5021127, 0.5021127, 0.5021127, 0.5021127, 0.5021127, 0.5021~
## $ v22 <dbl> 0.4751361, 0.4751361, 0.4751361, 0.4751361, 0.4751361, 0.4751~
## $ v23 <dbl> 0.5232065, 0.5232065, 0.5232065, 0.5232065, 0.5232065, 0.5232~
## $ v24 <dbl> 0.5392861, 0.5392861, 0.5392861, 0.5392861, 0.5392861, 0.5392~
## $ v25 <dbl> 0.5135054, 0.5135054, 0.5135054, 0.5135054, 0.5135054, 0.5135~
## $ v26 <dbl> 0.5681738, 0.5681738, 0.5681738, 0.5681738, 0.5681738, 0.5681~
## $ v27 <dbl> 0.4975432, 0.4975432, 0.4975432, 0.4975432, 0.4975432, 0.4975~
## $ v28 <dbl> 0.5718795, 0.5718795, 0.5718795, 0.5718795, 0.5718795, 0.5718~
## $ v29 <dbl> 0.5071041, 0.5071041, 0.5071041, 0.5071041, 0.5071041, 0.5071~
## $ v30 <dbl> 0.6218922, 0.6218922, 0.6218922, 0.6218922, 0.6218922, 0.6218~
## $ v31 <dbl> 0.5027553, 0.5027553, 0.5027553, 0.5027553, 0.5027553, 0.5027~
## $ v32 <dbl> 0.5027553, 0.5027553, 0.5027553, 0.5027553, 0.5027553, 0.5027~
## $ v33 <dbl> 0.5015588, 0.5015588, 0.5015588, 0.5015588, 0.5015588, 0.5015~
## $ v34 <dbl> 0.3643507, 0.3643507, 0.3643507, 0.3643507, 0.3643507, 0.3643~
## $ v35 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ v36 <dbl> 0.3980347, 0.3980347, 0.3980347, 0.3980347, 0.3980347, 0.3980~
## $ v37 <dbl> 0.5006957, 0.5006957, 0.5006957, 0.5006957, 0.5006957, 0.5006~
## $ v38 <dbl> 0.4631827, 0.4631827, 0.4631827, 0.4631827, 0.4631827, 0.4631~
## $ v39 <dbl> 0.5005019, 0.5005019, 0.5005019, 0.5005019, 0.5005019, 0.5005~
## $ v40 <dbl> 0.5064742, 0.5064742, 0.5064742, 0.5064742, 0.5064742, 0.5064~
## $ v41 <dbl> 0.5003714, 0.5003714, 0.5003714, 0.5003714, 0.5003714, 0.5003~
## $ response <dbl> 0.2513475, 0.2513475, 0.2513475, 0.2513475, 0.2513475, 0.2513~
viz_grid_Vvariable %>%
mutate(pred = predict(fit_nnet_24, newdata = viz_grid_Vvariable)) %>%
ggplot(mapping = aes(x = v04, y = response)) +
geom_point(alpha = 0.3, size = 1.85,
mapping = aes(color = v02)) +
geom_line(mapping = aes(y = pred,
group = interaction(v35,
v12,
v02),
color = v02),
size = 1.) +
facet_grid(v34 ~ v35, labeller = "label_both") +
scale_color_viridis_c(option = 'plasma') +
theme_bw()
viz_grid_Vvariable %>%
mutate(pred = predict(fit_pca_nnet28, newdata = viz_grid_Vvariable)) %>%
ggplot(mapping = aes(x = v04, y = response)) +
geom_point(alpha = 0.3, size = 1.85,
mapping = aes(color = v02)) +
geom_line(mapping = aes(y = pred,
group = interaction(v35,
v12,
v02),
color = v02),
size = 1.) +
facet_grid(v34 ~ v35, labeller = "label_both") +
scale_color_viridis_c(option = 'plasma') +
theme_bw()
#PART D, CLASSIFICATION PROBLEM
This example uses the tidyverse suite of packages.
library(corrplot)
library(caret)
library(yardstick)
library(ggridges)
library(tidymodels)
library(RSNNS)
## Warning: package 'RSNNS' was built under R version 4.1.2
## Loading required package: Rcpp
##
## Attaching package: 'Rcpp'
## The following object is masked from 'package:rsample':
##
## populate
##
## Attaching package: 'RSNNS'
## The following object is masked from 'package:parsnip':
##
## mlp
## The following objects are masked from 'package:caret':
##
## confusionMatrix, train
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
train_x <- readr::read_csv("train_input_set_x.csv", col_names = TRUE)
## Rows: 1899 Columns: 44
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (44): run_id, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_v <- readr::read_csv("train_input_set_v.csv", col_names = TRUE)
## Rows: 1899 Columns: 42
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (42): run_id, v01, v02, v03, v04, v05, v06, v07, v08, v09, v10, v11, v12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_outputs <- readr::read_csv("train_outputs.csv", col_names = TRUE)
## Rows: 1899 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): outcome
## dbl (2): run_id, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
The classification training data set for the “x-variables” is created below. The continuous output, response, is now dropped while the categorical variable outcome is retained with the “x-variable” inputs. The outcome variable is converted to a factor data type for you so that way all students will work with the same level (unique category) ordering. The classification training set for the “x-variables” is named ready_x_B and is shown via a glimpse below.
ready_x_B <- train_x %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -response) %>%
mutate(outcome = factor(outcome, levels = c("event", "non_event")))
ready_x_B %>% glimpse()
## Rows: 1,899
## Columns: 44
## $ x01 <dbl> 0.985235945, 0.766539206, 0.230160730, 0.004366989, 0.45283346~
## $ x02 <dbl> 0.8133165, 0.4090398, 0.1348584, 0.2153328, 0.4287669, 0.79754~
## $ x03 <dbl> 0.85431883, 0.59294184, 0.94329236, 0.74947991, 0.09339929, 0.~
## $ x04 <dbl> 0.40689188, 0.65407912, 0.59624660, 0.86524873, 0.02715081, 0.~
## $ x05 <dbl> 0.78843409, 0.29460688, 0.88023410, 0.46210092, 0.29478054, 0.~
## $ x06 <dbl> 0.568118821, 0.560441274, 0.416568244, 0.704653992, 0.03992967~
## $ x07 <dbl> 0.2861404, 0.9422601, 0.9588571, 0.3345177, 0.7958649, 0.69995~
## $ x08 <dbl> 0.3879258, 0.3147583, 0.9187132, 0.4178475, 0.3061616, 0.70736~
## $ x09 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.~
## $ x10 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.~
## $ x11 <dbl> 0.23910422, 0.77241094, 0.33454312, 0.41804550, 0.51938061, 0.~
## $ x12 <dbl> 0.95115757, 0.64939311, 0.03927392, 0.81421180, 0.50300581, 0.~
## $ x13 <dbl> 0.74115583, 0.52151850, 0.01774069, 0.43711821, 0.39755305, 0.~
## $ x14 <dbl> 0.29442855, 0.77183170, 0.42802137, 0.05115383, 0.26156947, 0.~
## $ x15 <dbl> 0.54888654, 0.76238208, 0.96098264, 0.20053861, 0.40990326, 0.~
## $ x16 <dbl> 0.04309077, 0.31195836, 0.94128034, 0.66557043, 0.30765023, 0.~
## $ x17 <dbl> 0.607216306, 0.690781613, 0.297202582, 0.705472640, 0.67151741~
## $ x18 <dbl> 0.38739725, 0.91016650, 0.63643358, 0.98006090, 0.73070809, 0.~
## $ x19 <dbl> 0.57354291, 0.32313514, 0.28422620, 0.41521691, 0.60079277, 0.~
## $ x20 <dbl> 0.340625897, 0.676583878, 0.094755464, 0.775107494, 0.95652908~
## $ x21 <dbl> 0.81360343, 0.64061909, 0.12519194, 0.92751984, 0.96419879, 0.~
## $ x22 <dbl> 0.26383148, 0.63779331, 0.55384338, 0.06816080, 0.08793851, 0.~
## $ x23 <dbl> 0.3842269, 0.3365798, 0.6029769, 0.1849245, 0.4525291, 0.78840~
## $ x24 <dbl> 0.73778515, 0.96692256, 0.67231493, 0.28378874, 0.76293502, 0.~
## $ x25 <dbl> 0.981282768, 0.108827549, 0.490767370, 0.222698257, 0.42287324~
## $ x26 <dbl> 0.84256615, 0.83927545, 0.97892574, 0.58282249, 0.08529926, 0.~
## $ x27 <dbl> 0.94379012, 0.95240429, 0.65735120, 0.34888112, 0.06593873, 0.~
## $ x28 <dbl> 0.45138135, 0.86558933, 0.78571934, 0.59198586, 0.26801323, 0.~
## $ x29 <dbl> 0.13943478, 0.47302080, 0.89679514, 0.72276574, 0.86769061, 0.~
## $ x30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.26229~
## $ x31 <dbl> 0.95251121, 0.65470056, 0.95660857, 0.66864426, 0.36616840, 0.~
## $ x32 <dbl> 0.04086708, 0.21303410, 0.66890361, 0.43711845, 0.31129828, 0.~
## $ x33 <dbl> 0.68512056, 0.77238560, 0.05876265, 0.15547193, 0.26599198, 0.~
## $ x34 <dbl> 0.6188148, 0.1416823, 0.5332164, 0.2219962, 0.8350435, 0.01138~
## $ x35 <dbl> 0.703602753, 0.231184484, 0.309159203, 0.294377175, 0.31719137~
## $ x36 <dbl> 0.34883336, 0.13826244, 0.19476157, 0.03123180, 0.82988622, 0.~
## $ x37 <dbl> 0.067194494, 0.146470386, 0.711926648, 0.427824194, 0.61302768~
## $ x38 <dbl> 0.65748519, 0.26970652, 0.97198304, 0.81151578, 0.91305515, 0.~
## $ x39 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.~
## $ x40 <dbl> 0.005055854, 0.164237960, 0.347327840, 0.190936463, 0.06340434~
## $ x41 <dbl> 0.5064933, 0.8071431, 0.9126245, 0.5700994, 0.8945523, 0.75832~
## $ x42 <dbl> 0.578485226, 0.921327641, 0.799359676, 0.223727471, 0.21767430~
## $ x43 <dbl> 0.12858286, 0.70429182, 0.61064123, 0.66999715, 0.71883998, 0.~
## $ outcome <fct> event, non_event, event, event, non_event, non_event, non_even~
ready_x_B %>%
ggplot(mapping = aes(x = outcome)) +
geom_bar() +
geom_text(stat = 'count',
mapping = aes(label = stat(count)),
color = 'red',
nudge_y = 7,
size = 5.5) +
theme_bw()
ready_x_B %>%
ggplot(mapping = aes(x = outcome)) +
geom_bar(mapping = aes(y = stat(prop),
group = 1)) +
geom_text(stat = 'count',
mapping = aes(y = after_stat( count / sum(count) ),
label = after_stat( signif(count / sum(count), 3) )),
color = 'red', nudge_y = 0.0275, size = 5.5) +
theme_bw()
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = value)) +
geom_histogram(bins = 27) +
facet_wrap(~ input_id, scales = "free_y") +
theme_bw() +
theme(axis.text.y = element_blank(),
strip.text = element_text(size = 6.5))
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_boxplot(mapping = aes(group = input_id), color = 'red') +
theme_bw()
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_boxplot(mapping = aes(group = interaction(input_id, outcome),
fill = outcome,
color = outcome),
alpha = 0.25) +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(legend.position = "top")
The value above appear to concentrate around 0.5 with few exception of input variables and the class-label appear to be even accross input variable.
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_boxplot(mapping = aes(group = interaction(input_id, outcome),
fill = outcome,
color = outcome),
alpha = 0.1) +
stat_summary(fun.data = 'mean_se',
fun.args = list(mult = 2),
mapping = aes(group = interaction(input_id, outcome),
color = outcome),
position = position_dodge(0.75)) +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(legend.position = "top")
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = as.factor(input_id), y = value)) +
stat_summary(fun.data = 'mean_se',
fun.args = list(mult = 2),
mapping = aes(group = interaction(input_id, outcome),
color = outcome)) +
ggthemes::scale_color_colorblind() +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(legend.position = "top")
The proportion of the event and non-event at each input variables
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = input_id), fill = 'grey') +
theme_bw()
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = input_id), fill = 'grey') +
facet_wrap(~input_bin, scales = "free_x") +
theme_bw() +
theme(strip.text = element_blank())
The distribution looks uniforms for the majority of input variables with 18 to 23 inputs values concentrated at 1 value.
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = interaction(input_id, outcome),
fill = outcome),
alpha = 0.55) +
facet_wrap(~input_bin, scales = "free_x") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank(),
legend.position = "top")
Beginning from number id 33 to 43 input variables, the class contribution appear to be similar by focusing on the top-right facet from above.
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
filter(input_id < 23 & input_id > 11) %>%
ggplot(mapping = aes(x = input_id, y = value)) +
geom_violin(mapping = aes(group = interaction(input_id, outcome),
fill = outcome),
alpha = 0.55) +
facet_wrap(~input_bin, scales = "free_x") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank(),
legend.position = "top")
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = value, y = as.factor(input_id))) +
geom_density_ridges() +
facet_wrap(~input_bin, scales = "free_y") +
theme_bw() +
theme(strip.text = element_blank())
## Picking joint bandwidth of 0.0603
## Picking joint bandwidth of 0.0589
## Picking joint bandwidth of 0.0578
## Picking joint bandwidth of 0.0607
Input variables 18 to 23 number id appear to be left-skewed distribution
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
ggplot(mapping = aes(x = value, y = as.factor(input_id))) +
geom_density_ridges(mapping = aes(fill = outcome),
alpha = 0.5) +
facet_wrap(~input_bin, scales = "free_y") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank())
## Picking joint bandwidth of 0.0671
## Picking joint bandwidth of 0.0677
## Picking joint bandwidth of 0.0665
## Picking joint bandwidth of 0.0698
ready_x_B %>%
tibble::rowid_to_column() %>%
pivot_longer(!c("rowid", "outcome")) %>%
mutate(input_id = as.integer( stringr::str_extract(name, "\\d+") )) %>%
mutate(input_bin = cut(input_id,
breaks = quantile(input_id),
include.lowest = TRUE)) %>%
filter(input_id < 23 & input_id > 11) %>%
ggplot(mapping = aes(x = value, y = as.factor(input_id))) +
geom_density_ridges(mapping = aes(fill = outcome),
alpha = 0.5) +
facet_wrap(~input_bin, scales = "free_y") +
ggthemes::scale_fill_colorblind() +
theme_bw() +
theme(strip.text = element_blank())
## Picking joint bandwidth of 0.0677
ready_x_B %>%
ggplot(mapping = aes(x = x12, y = x13)) +
geom_point(alpha = 0.25, size = 3) +
coord_equal() +
theme_bw()
ready_x_B %>%
ggplot(mapping = aes(x = x12, y = x13)) +
geom_point(alpha = 0.25, size = 3,
mapping = aes(color = outcome)) +
coord_equal() +
ggthemes::scale_color_colorblind() +
theme_bw() +
guides(color = guide_legend(override.aes = list(alpha = 1.0)))
The two input variables appear to be no correlation as the outcome color coding looks random. For clarification, let’s check the correlation plot below.
ready_x_B %>%
select(-outcome) %>%
cor() %>%
corrplot::corrplot(method = 'square', type = 'upper')
ready_x_B %>%
select(-outcome) %>%
cor() %>%
corrplot::corrplot(method = 'square', type = 'upper',
order = 'hclust', hclust.method = 'ward.D2')
ready_x_B %>%
group_by(outcome) %>%
tidyr::nest() %>%
mutate(cor_wf = map(data, corrr::correlate, quiet = TRUE, diagonal = 1),
cor_lf = map(cor_wf, corrr::stretch)) %>%
select(outcome, cor_lf) %>%
tidyr::unnest(cor_lf) %>%
ungroup() %>%
mutate(x_id = stringr::str_extract(x, '\\d+'),
y_id = stringr::str_extract(y, '\\d+')) %>%
mutate(x_id = as.integer(x_id),
y_id = as.integer(y_id)) %>%
ggplot(mapping = aes(x = as.factor(x_id), y = as.factor(y_id))) +
geom_tile(mapping = aes(fill = r), color = 'white') +
coord_equal() +
facet_wrap(~outcome, labeller = "label_both") +
scale_fill_gradient2('corr',
low = 'red', mid='white', high='blue',
midpoint = 0,
limits = c(-1, 1)) +
labs(x='', y = '') +
theme_bw() +
theme(legend.position = "top",
axis.text = element_text(size = 5.5))
set.seed(12)
cv_folds <- trainControl(method = "repeatedcv", number = 3, repeats = 9, classProbs = TRUE, summaryFunction = twoClassSummary, search = 'random')
my_metric <- "ROC"
gbmGrid <- expand.grid(interaction.depth = c(1,5,9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
nrow(gbmGrid)
#Linear additive interaction features with logistic regression
set.seed(12)
FitLinearModel29 <- caret::train(outcome ~ .,
data = ready_x_B,
method = "LogitBoost",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
FitLinearModel29 %>% readr::write_rds("my_29simple_model.rds")
#All pair-wise interaction features with logistic regression
set.seed(12)
FitLinearModel30 <- train(outcome ~ (.)^2,
data = ready_x_B,
method = "LogitBoost",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
FitLinearModel30 %>% readr::write_rds("my_30simple_model.rds")
#All 12 degree of freedom splines interaction between the inputs
set.seed(12)
FitLinearModel31 <- train(outcome ~ splines::ns(x09 + x11,df=12) * (.),
data = ready_x_B,
method = "LogitBoost",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
FitLinearModel31 %>% readr::write_rds("my_31simple_model.rds")
#All pair-wise interaction features of regularized logistic regression with Elastic net
set.seed(12)
fit_enet_32 <- train(outcome ~ (.)^2,
data = ready_x_B,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_enet_32 %>% readr::write_rds("my_32simple_model.rds")
#All cubic interaction features of regularized logistic regression with Elastic net
set.seed(12)
fit_enet_33 <- train(outcome ~ (.)^3,
data = ready_x_B,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_enet_33 %>% readr::write_rds("my_33simple_model.rds")
set.seed(12)
fit_rf34 <- caret::train(outcome ~ .,
data = ready_x_B,
method = "rf",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_rf34 %>% readr::write_rds("my_34simple_model.rds")
fit_rf34
set.seed(12)
fit_svm35 <- caret::train(outcome ~ .,
data = ready_x_B,
method = "svmRadial",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_svm35 %>% readr::write_rds("my_35simple_model.rds")
set.seed(12)
fit_nnet_36 <- caret::train(outcome ~ .,
data = ready_x_B,
method = "nnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds,
trace = FALSE,
linout = FALSE)
fit_nnet_36 %>% readr::write_rds("my_36simple_model.rds")
fit_nnet_36
set.seed(12)
fit_xgb37 <- caret::train(outcome ~ .,
data = ready_x_B,
method = "xgbTree",
metric = my_metric,
trControl = cv_folds,
objective = "reg:logistic")
fit_xgb37 %>% readr::write_rds("my_37simple_model.rds")
#Bonus Points
xgb_grid <- expand.grid(nrounds = seq(100, 700, by = 100),
max_depth = c(3, 4, 9),
eta = c(0.5*fit_xgb37$bestTune$eta, fit_xgb37$bestTune$eta),
gamma = fit_xgb37$bestTune$gamma,
colsample_bytree = fit_xgb37$bestTune$colsample_bytree,
min_child_weight = fit_xgb37$bestTune$min_child_weight,
subsample = fit_xgb37$bestTune$subsample)
fit_xgb37T <- caret::train(outcome ~ .,
data = ready_x_B,
method = "xgbTree",
tuneGrid = xgb_grid,
metric = my_metric,
trControl = cv_folds,
objective = "reg:logistic")
fit_xgb37T %>% readr::write_rds("my_37Tsimple_model.rds")
set.seed(12)
fit_pca_nnet38 <- caret::train(outcome ~ .,
data = ready_x_B,
method = "pcaNNet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds,
trace = FALSE,
linout = FALSE)
fit_pca_nnet38 %>% readr::write_rds("my_38simple_model.rds")
fit_pca_nnet38
#Loading of the model
FitLinearModel29 <- readr::read_rds("my_29simple_model.rds")
FitLinearModel30 <- readr::read_rds("my_30simple_model.rds")
FitLinearModel31 <- readr::read_rds("my_31simple_model.rds")
fit_enet_32 <- readr::read_rds("my_32simple_model.rds")
fit_enet_33 <- readr::read_rds("my_33simple_model.rds")
fit_rf34 <- readr::read_rds("my_34simple_model.rds")
fit_svm35 <- readr::read_rds("my_35simple_model.rds")
fit_nnet_36 <- readr::read_rds("my_36simple_model.rds")
fit_xgb37 <- readr::read_rds("my_37simple_model.rds")
fit_xgb37T <- readr::read_rds("my_37Tsimple_model.rds")
fit_pca_nnet38 <- readr::read_rds("my_38simple_model.rds")
my_resultsXClassification<-resamples(list(LM_29 = FitLinearModel29,
LM_30 = FitLinearModel30,
SPLINE_31 = FitLinearModel31,
ENET_32 = fit_enet_32,
ENET_33 = fit_enet_33,
RF34 = fit_rf34,
SVM35 = fit_svm35,
NNET_36 = fit_nnet_36,
XGB37 = fit_xgb37,
XGB37T = fit_xgb37T,
PCA_NNET38 = fit_pca_nnet38))
dotplot(my_resultsXClassification)
varImp(fit_xgb37T, scale = FALSE)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 43)
##
## Overall
## x09 0.563966
## x11 0.155449
## x05 0.047381
## x38 0.022131
## x39 0.021663
## x10 0.020143
## x41 0.018242
## x06 0.018130
## x08 0.015887
## x35 0.008165
## x12 0.007958
## x27 0.007330
## x01 0.006727
## x32 0.005843
## x37 0.005027
## x07 0.004716
## x16 0.004691
## x29 0.004517
## x20 0.004310
## x02 0.004157
plot(fit_xgb37T, top = 20)
splom(my_resultsXClassification)
densityplot(my_resultsXClassification)
bwplot(my_resultsXClassification, layout = c(3, 1))
Explain_VariablesX <- lime::lime(ready_x_B, fit_xgb37T, n_bins = 5)
Explanation of all Input features using the first 50 observations for the event
X_VariablesExplain <- lime::explain(
x = ready_x_B[50,],
explainer = Explain_VariablesX,
n_permutations = 500,
dist_fun = "gower",
kernel_width = 0.75,
n_features = 43,
feature_select = "highest_weights",
labels = "event"
)
lime::plot_features(X_VariablesExplain)
lime::plot_explanations(X_VariablesExplain)
Predictive Visualization and Explanation of all XInput features using the first 50 observations for the non-event
X_VariablesExplain_NonEvent <- lime::explain(
x = ready_x_B[50,],
explainer = Explain_VariablesX,
n_permutations = 500,
dist_fun = "gower",
kernel_width = 0.75,
n_features = 43,
feature_select = "highest_weights",
labels = "non_event"
)
lime::plot_features(X_VariablesExplain_NonEvent)
lime::plot_explanations(X_VariablesExplain_NonEvent)
Lastly, the classification data set associated with the “v-variables” is created below. The data set is named ready_v_B and a glimpse is shown, which again shows that the output is the last column in the data set. The model building and prediction visualization is also done for the best tune model, Xgboost as seen from the classification metrics among the candidate models.
ready_v_B <- train_v %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -response) %>%
mutate(outcome = factor(outcome, levels = c("event", "non_event")))
ready_v_B %>% glimpse()
## Rows: 1,899
## Columns: 42
## $ v01 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.~
## $ v02 <dbl> 0.59176441, 0.05360364, 0.78632885, 0.49152356, 0.15821610, 0.~
## $ v03 <dbl> 0.99866539, 0.74442030, 0.19855145, 0.10180881, 0.16904275, 0.~
## $ v04 <dbl> 0.51914601, 0.08378037, 0.62221318, 0.49275614, 0.08277755, 0.~
## $ v05 <dbl> 0.98440052, 0.73504281, 0.27906224, 0.22789122, 0.14089563, 0.~
## $ v06 <dbl> 0.27761280, 0.17613618, 0.09989673, 0.38340545, 0.20285082, 0.~
## $ v07 <dbl> 0.72065775, 0.76673777, 0.52718254, 0.68876044, 0.18952144, 0.~
## $ v08 <dbl> 0.25497542, 0.21544020, 0.20828020, 0.37008161, 0.24295170, 0.~
## $ v09 <dbl> 0.74392608, 0.79567991, 0.53658142, 0.72114180, 0.17622231, 0.~
## $ v10 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0.~
## $ v11 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.~
## $ v12 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0.~
## $ v13 <dbl> 0.974760511, 0.973723716, 0.998414906, 0.698134370, 0.00899540~
## $ v14 <dbl> 0.6033220, 0.3995356, 0.2323312, 0.6210052, 0.1473009, 0.14393~
## $ v15 <dbl> 0.961458306, 0.994421267, 0.994419338, 0.480866206, 0.02262710~
## $ v16 <dbl> 0.5845183, 0.3929352, 0.2359107, 0.5664388, 0.1438076, 0.12691~
## $ v17 <dbl> 0.9913476989, 0.9990456588, 0.9982807708, 0.3484922944, 0.0552~
## $ v18 <dbl> 0.58758891, 0.46163856, 0.46932273, 0.66083964, 0.17190848, 0.~
## $ v19 <dbl> 0.992575596, 0.998326752, 0.997408082, 0.384090286, 0.07063359~
## $ v20 <dbl> 0.62007956, 0.43412785, 0.39534349, 0.57752233, 0.15817533, 0.~
## $ v21 <dbl> 0.995455551, 0.998267938, 0.999570601, 0.419018421, 0.04033756~
## $ v22 <dbl> 0.67064607, 0.41399999, 0.37593184, 0.53785166, 0.15708468, 0.~
## $ v23 <dbl> 0.96558251, 0.99883838, 0.40326323, 0.39321015, 0.06023979, 0.~
## $ v24 <dbl> 0.8126079, 0.3969788, 0.3570253, 0.6673296, 0.1941783, 0.23148~
## $ v25 <dbl> 0.82547824, 0.99934056, 0.83645894, 0.53876807, 0.02006501, 0.~
## $ v26 <dbl> 0.7898702, 0.4028064, 0.4261415, 0.7360598, 0.2085493, 0.23774~
## $ v27 <dbl> 0.72778006, 0.99861090, 0.83560109, 0.75246718, 0.02473616, 0.~
## $ v28 <dbl> 0.7754927, 0.4112850, 0.4358383, 0.7761393, 0.2282164, 0.25683~
## $ v29 <dbl> 0.99418321, 0.98082307, 0.82981961, 0.10035939, 0.01207357, 0.~
## $ v30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.26229~
## $ v31 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.00628922~
## $ v32 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.00628922~
## $ v33 <dbl> 0.743168039, 0.169432565, 0.990940250, 0.904737500, 0.96688922~
## $ v34 <dbl> 0.35529833, 0.17737909, 0.45506100, 0.49154722, 0.41835025, 0.~
## $ v35 <dbl> 0.68272602, 0.20804263, 0.98143915, 0.88001101, 0.95389492, 0.~
## $ v36 <dbl> 0.29133870, 0.15346219, 0.37971022, 0.51743062, 0.55531942, 0.~
## $ v37 <dbl> 0.601994815, 0.329438589, 0.984413079, 0.862959415, 0.95936678~
## $ v38 <dbl> 0.28843957, 0.15620445, 0.44257632, 0.57036343, 0.59601264, 0.~
## $ v39 <dbl> 0.572457748, 0.352100439, 0.988011127, 0.861117030, 0.96330864~
## $ v40 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.~
## $ v41 <dbl> 0.57544185, 0.34392306, 0.98759446, 0.86143126, 0.96424137, 0.~
## $ outcome <fct> event, non_event, event, event, non_event, non_event, non_even~
set.seed(12)
library(caret)
cv_folds <- trainControl(method = "repeatedcv", number = 3, repeats = 9, classProbs = TRUE, summaryFunction = twoClassSummary, search = 'random')
my_metric <- "ROC"
gbmGrid <- expand.grid(interaction.depth = c(1,5,9),
n.trees = (1:30)*50,
shrinkage = 0.1,
n.minobsinnode = 20)
nrow(gbmGrid)
## [1] 90
#Linear additive interaction features with logistic regression
set.seed(12)
FitLinearModel39 <- caret::train(outcome ~ .,
data = ready_v_B,
method = "LogitBoost",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
FitLinearModel39 %>% readr::write_rds("my_39simple_model.rds")
#All pair-wise interaction features with logistic regression
set.seed(12)
FitLinearModel40 <- caret::train(outcome ~ (.)^2,
data = ready_v_B,
method = "LogitBoost",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
FitLinearModel40 %>% readr::write_rds("my_40simple_model.rds")
#All 12 degree of freedom splines interaction between the inputs
set.seed(12)
FitLinearModel41 <- caret::train(outcome ~ splines::ns(v04 + v06,df=12) * (.),
data = ready_v_B,
method = "LogitBoost",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
FitLinearModel41 %>% readr::write_rds("my_41simple_model.rds")
#All pair-wise interaction features of regularized logistic regression with Elastic net
set.seed(12)
fit_enet_42 <- caret::train(outcome ~ (.)^2,
data = ready_v_B,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_enet_42 %>% readr::write_rds("my_42simple_model.rds")
#All cubic interaction features of regularized logistic regression with Elastic net
set.seed(12)
fit_enet_43 <- caret::train(outcome ~ (.)^3,
data = ready_v_B,
method = "glmnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_enet_43 %>% readr::write_rds("my_43simple_model.rds")
set.seed(12)
fit_rf44 <- caret::train(outcome ~ .,
data = ready_v_B,
method = "rf",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_rf44 %>% readr::write_rds("my_44simple_model.rds")
fit_rf44
set.seed(12)
fit_svm45 <- caret::train(outcome ~ .,
data = ready_v_B,
method = "svmRadial",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds)
fit_svm45 %>% readr::write_rds("my_45simple_model.rds")
set.seed(12)
fit_nnet_46 <- caret::train(outcome ~ .,
data = ready_v_B,
method = "nnet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds,
trace = FALSE,
linout = FALSE)
fit_nnet_46 %>% readr::write_rds("my_46simple_model.rds")
fit_nnet_46
set.seed(12)
fit_xgb47 <- caret::train(outcome ~ .,
data = ready_v_B,
method = "xgbTree",
metric = my_metric,
trControl = cv_folds,
objective = "reg:logistic")
fit_xgb47 %>% readr::write_rds("my_47simple_model.rds")
set.seed(12)
#Bonus Points
xgb_grid <- expand.grid(nrounds = seq(100, 800, by = 100),
max_depth = c(3,9,12),
eta = c(0.5*fit_xgb47$bestTune$eta, fit_xgb47$bestTune$eta),
gamma = fit_xgb47$bestTune$gamma,
colsample_bytree = fit_xgb47$bestTune$colsample_bytree,
min_child_weight = fit_xgb47$bestTune$min_child_weight,
subsample = fit_xgb47$bestTune$subsample)
fit_xgb47T <- caret::train(outcome ~ .,
data = ready_v_B,
method = "xgbTree",
tuneGrid = xgb_grid,
metric = my_metric,
trControl = cv_folds,
objective = "reg:logistic")
fit_xgb47T %>% readr::write_rds("my_47Tsimple_model.rds")
set.seed(12)
fit_pca_nnet48 <- caret::train(outcome ~ .,
data = ready_v_B,
method = "pcaNNet",
metric = my_metric,
preProcess = c("center", "scale"),
trControl = cv_folds,
trace = FALSE,
linout = FALSE)
fit_pca_nnet48 %>% readr::write_rds("my_48simple_model.rds")
fit_pca_nnet48
#Loading of the model
FitLinearModel39 <- readr::read_rds("my_39simple_model.rds")
FitLinearModel40 <- readr::read_rds("my_40simple_model.rds")
FitLinearModel41 <- readr::read_rds("my_41simple_model.rds")
fit_enet_42 <- readr::read_rds("my_42simple_model.rds")
fit_enet_43 <- readr::read_rds("my_43simple_model.rds")
fit_rf44 <- readr::read_rds("my_44simple_model.rds")
fit_svm45 <- readr::read_rds("my_45simple_model.rds")
fit_nnet_46 <- readr::read_rds("my_46simple_model.rds")
fit_xgb47 <- readr::read_rds("my_47simple_model.rds")
fit_xgb47T <- readr::read_rds("my_47Tsimple_model.rds")
fit_pca_nnet48 <- readr::read_rds("my_48simple_model.rds")
my_resultsVClassification<-resamples(list(LM_39 = FitLinearModel39,
LM_40 = FitLinearModel40,
SPLINE_41 = FitLinearModel41,
ENET_42 = fit_enet_42,
ENET_43 = fit_enet_43,
RF44 = fit_rf44,
SVM45 = fit_svm45,
NNET_46 = fit_nnet_46,
XGB47 = fit_xgb47,
XGB47T = fit_xgb47T,
PCA_NNET48 = fit_pca_nnet48))
dotplot(my_resultsVClassification)
varImp(fit_xgb47T, scale = FALSE)
## xgbTree variable importance
##
## only 20 most important variables shown (out of 41)
##
## Overall
## v02 0.207377
## v10 0.172733
## v08 0.167506
## v04 0.124351
## v12 0.082465
## v06 0.073852
## v01 0.028269
## v11 0.017813
## v07 0.013963
## v09 0.011146
## v35 0.011123
## v33 0.009749
## v40 0.009432
## v05 0.009270
## v36 0.007086
## v38 0.007018
## v03 0.006787
## v39 0.005753
## v41 0.003682
## v24 0.003341
plot(fit_xgb47T, top = 20)
splom(my_resultsVClassification)
densityplot(my_resultsVClassification)
bwplot(my_resultsVClassification, layout = c(3, 1))
Explain_VariablesV <- lime::lime(ready_v_B,fit_xgb47T, n_bins = 5)
Explanation of all Input features using the first 50 observations for the event in v-variable
V_VariablesExplain <- lime::explain(
x = ready_v_B[50,],
explainer = Explain_VariablesV,
n_permutations = 500,
dist_fun = "gower",
kernel_width = 0.75,
n_features = 41,
feature_select = "highest_weights",
labels = "event"
)
lime::plot_features(V_VariablesExplain)
lime::plot_explanations(V_VariablesExplain)
Predictive Visualization and Explanation of all VInput features using the first 50 observations for the non-event
V_VariablesExplain_NonEvent <- lime::explain(
x = ready_v_B[50,],
explainer = Explain_VariablesV,
n_permutations = 500,
dist_fun = "gower",
kernel_width = 0.75,
n_features = 43,
feature_select = "highest_weights",
labels = "non_event"
)
lime::plot_features(V_VariablesExplain_NonEvent)
lime::plot_explanations(V_VariablesExplain_NonEvent)
In conclusion Tune XGboost model maximize the accuracy among the candidate models because its false-positive and false-negative is minimal for classifying outcome.
#PART E, INTERPRETATION AND OPTIMIZATION
library(caret)
train_x <- readr::read_csv("train_input_set_x.csv", col_names = TRUE)
## Rows: 1899 Columns: 44
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (44): run_id, x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_v <- readr::read_csv("train_input_set_v.csv", col_names = TRUE)
## Rows: 1899 Columns: 42
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (42): run_id, v01, v02, v03, v04, v05, v06, v07, v08, v09, v10, v11, v12...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
train_outputs <- readr::read_csv("train_outputs.csv", col_names = TRUE)
## Rows: 1899 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): outcome
## dbl (2): run_id, response
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
ready_v_B <- train_v %>%
left_join(train_outputs, by = 'run_id') %>%
select(-run_id, -response) %>%
mutate(outcome = factor(outcome, levels = c("event", "non_event")))
ready_v_B %>% glimpse()
## Rows: 1,899
## Columns: 42
## $ v01 <dbl> 0.60550075, 0.03512355, 0.79408378, 0.65352400, 0.21140873, 0.~
## $ v02 <dbl> 0.59176441, 0.05360364, 0.78632885, 0.49152356, 0.15821610, 0.~
## $ v03 <dbl> 0.99866539, 0.74442030, 0.19855145, 0.10180881, 0.16904275, 0.~
## $ v04 <dbl> 0.51914601, 0.08378037, 0.62221318, 0.49275614, 0.08277755, 0.~
## $ v05 <dbl> 0.98440052, 0.73504281, 0.27906224, 0.22789122, 0.14089563, 0.~
## $ v06 <dbl> 0.27761280, 0.17613618, 0.09989673, 0.38340545, 0.20285082, 0.~
## $ v07 <dbl> 0.72065775, 0.76673777, 0.52718254, 0.68876044, 0.18952144, 0.~
## $ v08 <dbl> 0.25497542, 0.21544020, 0.20828020, 0.37008161, 0.24295170, 0.~
## $ v09 <dbl> 0.74392608, 0.79567991, 0.53658142, 0.72114180, 0.17622231, 0.~
## $ v10 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0.~
## $ v11 <dbl> 0.99715803, 0.90050345, 0.52073711, 0.74925165, 0.16071140, 0.~
## $ v12 <dbl> 0.25743069, 0.19289445, 0.38850662, 0.38366215, 0.22070718, 0.~
## $ v13 <dbl> 0.974760511, 0.973723716, 0.998414906, 0.698134370, 0.00899540~
## $ v14 <dbl> 0.6033220, 0.3995356, 0.2323312, 0.6210052, 0.1473009, 0.14393~
## $ v15 <dbl> 0.961458306, 0.994421267, 0.994419338, 0.480866206, 0.02262710~
## $ v16 <dbl> 0.5845183, 0.3929352, 0.2359107, 0.5664388, 0.1438076, 0.12691~
## $ v17 <dbl> 0.9913476989, 0.9990456588, 0.9982807708, 0.3484922944, 0.0552~
## $ v18 <dbl> 0.58758891, 0.46163856, 0.46932273, 0.66083964, 0.17190848, 0.~
## $ v19 <dbl> 0.992575596, 0.998326752, 0.997408082, 0.384090286, 0.07063359~
## $ v20 <dbl> 0.62007956, 0.43412785, 0.39534349, 0.57752233, 0.15817533, 0.~
## $ v21 <dbl> 0.995455551, 0.998267938, 0.999570601, 0.419018421, 0.04033756~
## $ v22 <dbl> 0.67064607, 0.41399999, 0.37593184, 0.53785166, 0.15708468, 0.~
## $ v23 <dbl> 0.96558251, 0.99883838, 0.40326323, 0.39321015, 0.06023979, 0.~
## $ v24 <dbl> 0.8126079, 0.3969788, 0.3570253, 0.6673296, 0.1941783, 0.23148~
## $ v25 <dbl> 0.82547824, 0.99934056, 0.83645894, 0.53876807, 0.02006501, 0.~
## $ v26 <dbl> 0.7898702, 0.4028064, 0.4261415, 0.7360598, 0.2085493, 0.23774~
## $ v27 <dbl> 0.72778006, 0.99861090, 0.83560109, 0.75246718, 0.02473616, 0.~
## $ v28 <dbl> 0.7754927, 0.4112850, 0.4358383, 0.7761393, 0.2282164, 0.25683~
## $ v29 <dbl> 0.99418321, 0.98082307, 0.82981961, 0.10035939, 0.01207357, 0.~
## $ v30 <dbl> 0.8260707, 0.4373017, 0.4776068, 0.8331016, 0.2544287, 0.26229~
## $ v31 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.00628922~
## $ v32 <dbl> 0.994932236, 0.995900801, 0.832793431, 0.176949599, 0.00628922~
## $ v33 <dbl> 0.743168039, 0.169432565, 0.990940250, 0.904737500, 0.96688922~
## $ v34 <dbl> 0.35529833, 0.17737909, 0.45506100, 0.49154722, 0.41835025, 0.~
## $ v35 <dbl> 0.68272602, 0.20804263, 0.98143915, 0.88001101, 0.95389492, 0.~
## $ v36 <dbl> 0.29133870, 0.15346219, 0.37971022, 0.51743062, 0.55531942, 0.~
## $ v37 <dbl> 0.601994815, 0.329438589, 0.984413079, 0.862959415, 0.95936678~
## $ v38 <dbl> 0.28843957, 0.15620445, 0.44257632, 0.57036343, 0.59601264, 0.~
## $ v39 <dbl> 0.572457748, 0.352100439, 0.988011127, 0.861117030, 0.96330864~
## $ v40 <dbl> 0.32414769, 0.17793792, 0.43088254, 0.57299163, 0.61438124, 0.~
## $ v41 <dbl> 0.57544185, 0.34392306, 0.98759446, 0.86143126, 0.96424137, 0.~
## $ outcome <fct> event, non_event, event, event, non_event, non_event, non_even~
Loading the top model of X and V variable for Regression and Classification Problem
#Regression
#X-variable
fit_nnet_14 <- readr::read_rds("my_14simple_model.rds")
#V-variable
fit_nnet_24 <- readr::read_rds("my_24simple_model.rds")
#Classification
#X-variable
fit_xgb37T <- readr::read_rds("my_37Tsimple_model.rds")
#V-variable
fit_xgb47T <- readr::read_rds("my_47Tsimple_model.rds")
fit_nnet_reg_vip14 <- fit_nnet_14 %>% varImp() %>%
plot()
fit_nnet_reg_vip14
fit_nnet_reg_vip24 <- fit_nnet_24 %>% varImp() %>%
plot()
fit_nnet_reg_vip24
fit_xgb37T_vip <- fit_xgb37T %>% varImp() %>%
plot()
fit_xgb37T_vip
fit_xgb47T_vip <- fit_xgb47T %>% varImp() %>%
plot()
fit_xgb47T_vip
holdout_x <- readr::read_csv("holdout_inputs_x.csv", col_names = TRUE)
## Rows: 300 Columns: 43
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (43): x01, x02, x03, x04, x05, x06, x07, x08, x09, x10, x11, x12, x13, x...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
holdout_v <- readr::read_csv("holdout_inputs_v.csv", col_names = TRUE)
## Rows: 300 Columns: 41
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (41): v01, v02, v03, v04, v05, v06, v07, v08, v09, v10, v11, v12, v13, v...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
y0 = predict(fit_nnet_24, holdout_v)
predict(fit_xgb47T, holdout_v) %>% class()
## [1] "factor"
predict(fit_xgb47T, holdout_v, type = 'prob') %>% class()
## [1] "data.frame"
predict(fit_xgb47T, holdout_v, type = 'prob') %>% head()
## event non_event
## 1 0.096850328 0.9031496719
## 2 0.857257307 0.1427426934
## 3 0.002482155 0.9975178449
## 4 0.062169079 0.9378309213
## 5 0.941977799 0.0580222011
## 6 0.999349535 0.0006504655
predict(fit_nnet_14, holdout_x) %>% class()
## [1] "numeric"
predict(fit_xgb37T, holdout_x) %>% class()
## [1] "factor"
predict(fit_xgb37T, holdout_x, type = 'prob') %>% class()
## [1] "data.frame"
myR_Shiny_Prediction <- tibble::tibble(
response = predict(fit_nnet_24, newdata = holdout_v),
outcome = predict(fit_xgb47T, newdata = holdout_v)) %>%
bind_cols(
predict(fit_xgb47T, newdata = holdout_v, type='prob') %>%
select(probability = event)) %>%
tibble::rowid_to_column("id")
myR_Shiny_Prediction %>% readr::write_csv("myShinyResult.csv", col_names = TRUE)
Explain_VariablesV <- lime::lime(ready_v_B,fit_xgb47T, n_bins = 5)
V_VariablesExplain <- lime::explain(
x = ready_v_B[50,],
explainer = Explain_VariablesV,
n_permutations = 500,
dist_fun = "gower",
kernel_width = 0.75,
n_features = 6,
feature_select = "highest_weights",
labels = "event"
)
lime::plot_features(V_VariablesExplain)
lime::plot_explanations(V_VariablesExplain)
# Visualization for the top 6 variables the non_event probability
V_VariablesExplain_NonEvent <- lime::explain(
x = ready_v_B[50,],
explainer = Explain_VariablesV,
n_permutations = 500,
dist_fun = "gower",
kernel_width = 0.75,
n_features = 6,
feature_select = "highest_weights",
labels = "non_event"
)
lime::plot_features(V_VariablesExplain_NonEvent)
lime::plot_explanations(V_VariablesExplain_NonEvent)
#
objective_func <- function(inputs, model, is_regression)
{
as_list <- expand.grid(
v01 = inputs[1],v02 = inputs[2],v03 = inputs[3],v04 = inputs[4],
v05 = inputs[5],v06 = inputs[6],v07 = inputs[7],v08 = inputs[8],
v09 = inputs[9],v10 = inputs[10],v11 = inputs[11],v12 =inputs[12],
v13 = inputs[13],v14 = inputs[14],v15 = inputs[15],
v16 = inputs[16],v17 = inputs[17],v18 = inputs[18],v19 = inputs[19],
v20 = inputs[20],v21 = inputs[21],v22 = inputs[22],v23 = inputs[23],v24 = inputs[24],v25 = inputs[25],v26 = inputs[26],v27 = inputs[27],v28 = inputs[28],v29 = inputs[29],v30 = inputs[30],v31 = inputs[31],v32 = inputs[32],v33 = inputs[33],v34 = inputs[34],v35 = inputs[35],v36 = inputs[36],
v37 = inputs[37],v38 = inputs[38],v39 = inputs[39],v40 = inputs[40],v41 = inputs[41]
)
if (is_regression)
{
prediction <- predict(model, as_list)
return(prediction)
}
else
{
prediction <- predict(model, as_list, type = 'prob')
return(prediction$event)
}
}
b <- as.vector(holdout_v)
guess <- as.numeric(rep(0.5, length(b)))
nnet_reg_fit_a <- optim(
guess,
objective_func,
gr = NULL,
fit_nnet_24,
TRUE,
method = "L-BFGS-B",
hessian = FALSE,
lower = rep(0, 41),
upper = c(rep(1, 4), rep(Inf, 37)),
control = list(fnscale = 1)
)
nnet_reg_fit_a$par
## [1] 0.000000000 0.435278676 1.000000000 0.008485782 0.046133616 5.321876012
## [7] 0.056167658 0.008076615 0.051298181 3.263456289 0.088463376 3.469176975
## [13] 0.771225161 0.000000000 0.917604152 0.000000000 0.303225075 0.861695726
## [19] 0.363028536 0.000000000 0.189554096 0.360027720 0.000000000 0.001210091
## [25] 1.828243296 0.000000000 0.090567760 0.143188848 1.740344577 0.413529621
## [31] 1.449288037 1.504820838 0.587805810 3.129450989 2.611825473 1.730481843
## [37] 2.728389659 0.450076574 1.835975066 0.651443445 1.230828627
plot(nnet_reg_fit_a$par)
b <- as.vector(holdout_v)
guess <- as.numeric(rep(0.5, length(b)))
xgb_class_fit47T <- optim(
guess,
objective_func,
gr = NULL,
fit_xgb47T,
FALSE,
method = "L-BFGS-B",
hessian = FALSE,
lower = rep(0, 41),
upper = c(rep(1, 4), rep(Inf, 37)),
control = list(fnscale = 1)
)
xgb_class_fit47T$par
## [1] 0.5000000 0.5000000 0.5000000 0.4114871 0.5000000 0.5000000 0.5000000
## [8] 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
## [15] 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
## [22] 0.5000000 0.5000000 0.5000000 0.4904335 0.5000000 0.5000000 0.5000000
## [29] 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
## [36] 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000 0.5000000
plot(xgb_class_fit47T$par)
As seen from the above optimized visualization, for the binary classification of the inputs variable, only the v04 & v25 input variables minimize the event probability according to the optim minimization approach.
However, for the continuous output variable, only v1,v14,v16,v20,v23 & v26 minimize the continuous output variables according to the optim minimization approach.
There are more V-variable that minimize the continuous output variables (only v1,v14,v16,v20,v23 & v26) than the one that minimize the event probability (only the v04 & v25).